Ordinance Survey OpenData (Part 2 – Importing The Data)

December 10, 2010 – 15:26

All of the data is in different files; SSIS is capable of extracting data from multiple files however for the purposes of this article I am going to stick to the Import Export Wizard.

To combine all of the files into one (big) file a quick switch to the command prompt is required:

type data\*.csv > .\CodePointOpenCombined.csv

Because none of the data files have headers this works fine, if they did have headers some work would be needed to strip those out.

Create a new database in SQL Server then follow these steps:

  1. Right Click the Database select “Tasks” – “Import Data”.
  2. In the Data Source step change the drop down to “Flat File Source”.
  3. Select the combined file we created above (you may have to change the filter).
  4. Check the Columns page if Quotation Marks (“) appear in some of the columns change the Text qualifier field on the General Page to a “.
  5. On the Advanced page click Suggest Types.
  6. Set the number of rows to 1000 (the maximum), then click OK.
  7. Go through each column and update the name and DataType to match those we discovered in the previous post.
  8. Check the correct database and table are selected on the next two steps.
  9. Click Next then Next again, then check over the data type mappings.
  10. Click Next then ensure Run immediately is checked then click finish.
  11. All being well, all of the data will be imported successfully.

If there are problems importing the data you can go back and make changes to the configuration, typically the issue is incorrect data types (too small) or incorrect text delimiters.

You may be asking why we went to tall that trouble, and time, only to let the Import Data Wizard suggest the data types. The reason I wrote the script was the wizard is limited to checking the first 1,000 lines; even if you set the value to 2,000,000 it will default down to 1000 after you move your focus away.

The result being if your data is naturally sorted on a specific column as some of the Ordinance Survey data appears to be the import will fail. Running the schema scanner allows you to scan through all of the data so that you can modify the suggested data types to match the maximum values.

Ordinance Survey OpenData (Part 1 – Schema Scanner)

December 3, 2010 – 16:49

In April 2010 the Ordinance Survey released certain parts of their data under special licence which allows for commercial use without cost. All the types of data made available are outside the scope of this post although I hope that the techniques described could be applied to any data set not limited to Ordinance Survey data.

In this post I am going to look at Code-Point Open, a list of all UK postcodes with their corresponding spatial positions. Unlike many other OS OpenData downloads the ZIP file does not contain the User Guide or the Schema Data, this can be found on the website, I spent a good 10 minutes searching for this data.

The term for what we are doing in this post is Extract-Transform -Load (ETL), a process in which we take data in one format and covert it for use in another format. Generally ETL is used to take a flat file format and load it for use in a relational database, although technically any format or database could be used. SQL Server offers two built-in mechanisms to perform ETL; the “Import Export Wizard” and SQL Server Integration Services (SSIS). The “Import Export Wizard” actually creates a SSIS package in the background and is available to all versions of SQL Server, SSIS  is not available in SQL Express.

Before we create a table in a SQL Server Database we need to know something about the data we are importing, the documentation for Code-Point Open tells us the data contains the following fields:

Postcode, Quality, Unused1, Unused2, Unused3, Unused4, Unused5, Unused6, Unused7, Unused8, Eastings, Northings, CountryCode, RegionalHealthAuthority, HealthAuthority, AdminCounty, AdminDistrict, AdminWard, Unused10

A number of the fields are not used, the fields and the dummy data held within them will be weeded out at a later date, we know the fields but we don’t know the format of the data it contains, it could be numeric, strings, decimals, telephone numbers? I created a PowerShell script which scans through all of these files to work out what type of field it is and the range of data held within it, be warned it will take a few hours to run!

# Schema Scanner v1.0
# ©2010 Richard Slater
 
# Create an empty hash table
$columns = @{}
 
# Loop through every file that matches this pattern
foreach ($file in Get-ChildItem -Path "D:\OSOpenData\Code-Point Open\data\ze.csv")
{
	Write-Host "Processing $file"
 
	# PowerShell Import-Csv cmdlet is pretty powerful, but if there is no header row you must feed it in
	$PostCodeData = Import-Csv $file -Header "Postcode","Quality","Unused1","Unused2","Unused3","Unused4","Unused5","Unused6","Unused7","Unused8","Eastings","Northings","CountryCode","RegionalHealthAuthority","HealthAuthority","AdminCounty","AdminDistrict","AdminWard","Unused10"
 
	# Go through each row in the file
	foreach($row in $PostCodeData)
    {
		# Go through each column in the row
		foreach ($attr in (Get-Member -InputObject $PostCodeData[0] -MemberType NoteProperty))
		{
			$key = $attr.Name
 
			# Ignore unused columns
			if ($key.StartsWith("Unused"))
				{ continue }
 
			# Construct an object to store the meta data, store it in the hash table to retreive next loop
			$column = New-Object PSObject
			if (!$columns.ContainsKey($key))
			{
				$column | Add-Member -Type NoteProperty -Name StringLength -Value 0
				$column | Add-Member -Type NoteProperty -Name MaxValue -Value ([System.Int32]::MinValue)
				$column | Add-Member -Type NoteProperty -Name MinValue -Value ([System.Int32]::MaxValue)
				$columns.Add($key, $column)
			}
			else
				{ $column = $columns.Get_Item($key) }
 
			$isInt = $false
			$value = 0;
 
			# Work out if this is an integer type
			if ([System.Int32]::TryParse($row.($key), [ref] $value))
            	{ $isInt = $true }
 
			if (!$isInt)
            {
				# it is not an integer how many characters is the string
            	if (($row.($key)).Length -gt $column.StringLength)
                	{ $column.StringLength = ($row.($key)).Length }
 
				continue
            }
 
			# it is an integer start working out the maximum and minimum values
			if ( $value -gt $column.MaxValue ) { $column.MaxValue = $value }
			if ( $value -lt $column.MinValue ) { $column.MinValue = $value }
 
			$columns.Set_Item($key, $column)
		}
	}
}
 
# Print a report of all of the fields
foreach ($field in $columns.Keys)
{
	$stringLength = $columns[$field].StringLength
	$numericMax = $columns[$field].MaxValue
	$numericMin = $columns[$field].MinValue
 
	if ($stringLength -gt 0)
	{
		Write-Host "$field (String) : Length =" $columns[$field].StringLength
	}
	elseif (($numericMax -gt ([System.Int32]::MinValue)) -and ($numericMin -lt ([System.Int32]::MaxValue)))
	{
		Write-Host "$field (Numeric) : MaxValue =" $numericMax ", MinValue =" $numericMin
	}
	else
	{
		Write-Host "$field (Empty)"
	}
}

The output from the script should give you enough information to construct a nice tight schema to import the data:

AdminWard (String) : Length = 2
AdminDistrict (String) : Length = 2
AdminCounty (Numeric) : MinValue = 0 , MaxValue = 47
Quality (Numeric) :  MinValue = 10 , MaxValue = 90
RegionalHealthAuthority (String) : Length = 3
Postcode (String) : Length = 7
Eastings (Numeric) : MinValue = 0 , MaxValue = 655448
Northings (Numeric) : MinValue = 0 , MaxValue = 1213660
CountryCode (Numeric) : = 64 ,  MaxValue   = 220
HealthAuthority (String) : Length = 3

In a future post I am going to take it to the next stage; create a table and complete the import with the Import Export Wizard. I would also like to improve the performance of the schema scanner by converting the code into C#.

SchemaScanner

Things to listen to on the way to work

November 19, 2010 – 21:23

I spend about an hour and a half on the bus every day commuting to and from work. It was a huge boon for me when I got my HTC Desire as it opened up the world of Google Listen.

I wanted to go through some of the Podcasts that I listen to in the hope that someone else in my position can enjoy them and hopefuly learn as much from them as I have:

Software Development / Programming

  • DotNetRocks – one of the best produced .NET podcasts out there, Carl Franklin and Richard Campbell. All topics related to the .NET Platform and software development in general.
  • Hanselminutes – Scott Hanselman is a Microsoft Developer, Family Man and general Geek, podcasts are of a concise and casual nature covering mainly programming matters, but sometimes branching out into other areas of Scott’s interests.
  • Herding Code – kind of reminds me of the Spanish Inquisition (which I remember well…) many hosts one topic or guest. The shows can be a little long.
  • SitePoint Podcast – a fairly new addition to my listening list, I am not convinced on the format but there have been some good shows.
  • Sparkling Client – is short, really short. Which is great as the information is presented clearly and quickly, very programmer like. Makes me happy to see it appear on my List.

Entertainment

  • This Developers Life - modelled after This American Life some podcast I have never heard of, hosted by Scott Hanselman and Rob Conery; it is different, it’s not information its about experience it is also fairly new and has music.
  • Escape Pod – Sci-Fi podcast and blog, what can I say… I love Sci-Fi. Have a listen some of the stories are absolutely fantastic, well worth a listen.

So if you have the time to try some of these give them a go, let me know what you think and what else you find.

SteamBirds

November 19, 2010 – 13:16

Danc gave those who read LostGarden a heads up that Spry Fox was releasing SteamBirds for the Android platform. I have been watching Spry Fox since Dan mentioned it on Lost Garden. Due to being plain busy (no pun intended) I never got a chance to try SteamBirds on the web. Now it is on my phone I get a chance to play it while on the bus!

The premise is you are flying a nuclear-powered steam driven air plane. I encourage you to read the embeded back-story in the initial levels, it is great to see good writing has gone in with the game.

The game is turn based, you control your planes by selecting the type of manoeuvre then dragging navigation icon to the location you wish the plane to be at the next turn; the black line shows the path your plane will take.

If you tap your plane you see the name, health, stats and the manoeuvres it is capable of, as you have probably gathered by now the concept is to shoot down the other teams.

The healthier you keep your planes the more stars you get when you are victorious, the more stars you get the higher the rank you achieve.

I am really pleased I took the time to download this game, and even more pleased that Danc and SpryFox have been able to break away from the mould and start creating well designed, fun and highly polished games without loosing the indie edge.

Encapsulating Alpha Fade in Unity3d

November 12, 2010 – 21:09

Several days into my late Unity3d project I realised that the was a bulk of code designed solely to make an object invisible by fading it out of the scene. The code was not complex although because of the way it was all in one class it appeared complex.

After doing some research into the best way to go about making this change, realised it was both convenient and logical to extract the code into a separate script and attaching that to the object that I wished to apply the effect to.

This meant that instead of nested if-statements for state management in a script attached to the Main Camera I was able to make declarative statements:

GameObject.Find(“TargetObject”).GetComponent<SmoothAlpha>().MakeVisible();

I have named the script SmoothAlpha only by my convention no actual smoothing or damping of the alpha value, it is simply a linear reduction of the materials alpha value.

There are many improvements that could be made to the script, some of which I may well do over the coming weeks:

  • Should include a delegate call back to signal when the fade is complete.
  • Should include methods to instantly make an object (in)visible.
  • Should support changing the alpha of child GameObjects in unison with the parent.

In have included the full script below the cut.

Unity3d plus three weeks

November 6, 2010 – 2:07

It is about 3 weeks on since I started learning Unity3d, and today Makemedia delivered the product we were working on using it. I have thoroughly enjoyed the process and the experience, there is still much to learn however I am much more confident experimenting with Unity3d to see what I can come up with.

As I am not going out to see the fireworks tonight, I put together a 3d scene to demonstrate some of Unity Basics particle effects in the form of a personal fireworks display.

Move around in the scene using W, A, S, D or the Arrow Keys you can move your head by moving the mouse.

Some of the rockets go a bit crazy some time and fire off into the distance at low speeds which is quite peculiar when it is towards the camera.

Source Code

Unity 3D

October 16, 2010 – 21:18

I am starting a project next week using Unity3D, I have known about this for a while and have poked around a bit to try and figure out how to do various things. I have been really impressed as to how fast you can get something done with what is basically a free product.

There are some superb videos on Unity3d Student which have been invaluable. I think I have got my head around the controls, UI, paradigms and scripting, although not all in one scene. As evidence of my work I have made three balls bounce around the screen, something akin to a carpenter making a door stop perhaps, but I am sure I will learn.

Changing code accessibility modifiers quickly

October 12, 2010 – 21:20

I have two hints today both of them involving changing accessibility modifiers. The first is a feature of CodeRush that I accidentally discovered while testing out the Visual Studio 2010 Productivity Power Tools the second is a great new feature of Visual Studio 2010.

For those who don’t know what I mean by accessibility modifiers, those are the keywords you put before blocks of code that define how that code can be accessed, this is all enforced by the compiler giving you nice compiler error messages if you violate these rules. To provide an example the following auto-properties are all have the accessibility modifier “public”:

[Column] public string Name { get; set; }
[Column] public string Description { get; set; }
[Column] public decimal Price { get; set; }
[Column] public string Category { get; set; }

CodeRush gives you a handy way to change the User interface by clicking the icon to the left of the code block, you then get a set of actions you can perform upon the that block of code.

Code Rush Code Context Menu

As with most everything in CodeRush you can access this functionality from the keyboard anywhere within the scope of that code block by pressing Alt + Up or Alt + Down to cycle through the five possibilities:

  • public
  • internal
  • protected internal
  • protected
  • private

This means that if you need to change access modifiers, either to tighten up or relax the access points into your code you can do it by pressing a hand full of keys (between two and six to be exact), much faster than navigating to the top of the method highlighting and replacing the keyword.

My second hint is something that I have been using quite regularly, lets take the above example again:

[Column] public string Name { get; set; }
[Column] public string Description { get; set; }
[Column] public decimal Price { get; set; }
[Column] public string Category { get; set; }

I want to change all four fields to “be internal”, I could go in and change each one manually or better still use the above keyboard shortcuts (Alt + Down, Down, Alt + Down, etc.) the faster alternative would be to hold down the Alt key then select from the first character of the first “public” to the last character of the last “public” keyword.

While holding down Alt traditional selection behaviour is not followed and only the “public” keywords are selected on all four lines:

VS2010 Line Select

You can then type your new visibility modifier overwriting the selected components of all four lines in one fell swoop:

[Column] internal string Name { get; set; }
[Column] internal string Description { get; set; }
[Column] internal decimal Price { get; set; }
[Column] internal string Category { get; set; }

I think that is neat!

Elevated Command Prompt on Vista and Windows 7

October 7, 2010 – 13:32

I explained how to get an elevated Command Prompt to perform system tasks in the comments of my post about setting the MTU in Windows 7, I am writing the up a bit clearer and linking it from that post. I have a blog post for Windows 8 if you have already made the move.

In Vista and Windows 7 applications don’t automatically get administrator privilege, they either need to request it or the user needs to explicitly start the application as an Administrator. The way to do this with the Command Prompt is as follows:

  1. Press the “Start” button.
  2. Type “Command”.
  3. “Command Prompt” will be shown in the search results.
  4. Right Click “Command Prompt” and select “Run as Administrator” (it will have a blue and yellow shield beside it).
  5. When prompted click “Yes” to allow Command Prompt to start as Administrator.
  6. You will know it has worked because the title bar will start with “Administrator:”

Finding Code Issues

October 1, 2010 – 22:02

As programmers we are well known for writing bug free perfectly formed code first time round, as such I don’t really understand why DevExpress implemented the code issues feature… no wait… that should be the other way round.

CodeRush has several ways to access the Code Issues user interface, first off there is the thin file overview down the right hand side of the viewport. In the code itself there are handy contextual hints too.

Code Rush Issue In Context

In this example, “this.” is redundant and has been greyed out, if you hover over it it will tell you more about the issue and how to resolve it. Different issues have different notation so code that is to be transformed will have a coloured underline.

In the process of going though a file created by someone else the code style was a bit off – I was able to quickly bring it in line with the project preferences by using the shortcut keys Alt + PageDown to skip between the issues then Ctrl + ‘ to fix the issues quickly.

Great feature for spotting mistakes as you code, and also for learning new language features as they are introduced into the specification.