Ordinance Survey OpenData (Part 3 – Cleaning Up)

Friday, December 17th, 2010

If you look through the schema of the table we imported in Part 2 there are a number of unused fields and some of the data appears to be missing.

Cleaning up the Schema

  1. You can go right ahead and remove the fields that start with “Unused” as far as I can tell the full version of Code-Point uses these fields.
  2. Remove the nullable attributes from all of the fields, this will prevent us from doing something silly at a later date, and will avoid Object Relational Mappers such as Entity Framework from creating nullable data types.
  3. Many of the fields contain codes not data itself but codes that describe other data, so lets prepend code on the end of those fields for the time being.

Cleaning up the Data

The quality column in Code-Point Open describes the source and reliability of the data, it ranges from the most accurate 10 through to no data 90, when building a system around this data you need to decide at what data is important to your use case. The following query will give you an idea of the quality of the dataset as a whole, I have annotated it based upon the OS Code-Point documentation.

FROM [OSOpenData].[dbo].[CodePointOpenCombined]
GROUP BY Quality
ORDER BY Quality
Quality Count Description
10 1683975 Within the building of the matched address closest to the postcode mean determined automatically by Ordnance Survey.
20 73 As above, but determined to visual inspection by GROS (General Register Office for Scotland).
30 1086 Approximate to within 50 m of true position.
40 52 The mean of the positions of addresses previously matched in ADDRESS-POINT but which have subsequently been deleted or recoded.
50 4395 Estimated position based on surrounding postcode coordinates, usually to 100 m resolution, but 10 min Scotland.
60 93 Postcode sector mean (direct copy from ADDRESS-POINT).
90 6361 No coordinates available.

For my purposes I want to use the coordinate data stored in the Eastings and Northings columns, which makes postcodes with no data useless to me, I can remove these with the following SQL script:

DELETE FROM [CodePointOpenCombined]
WHERE [Quality] = 90

Ordinance Survey OpenData (Part 2 – Importing The Data)

Friday, December 10th, 2010

All of the data is in different files; SSIS is capable of extracting data from multiple files however for the purposes of this article I am going to stick to the Import Export Wizard.

To combine all of the files into one (big) file a quick switch to the command prompt is required:

type data\*.csv > .\CodePointOpenCombined.csv

Because none of the data files have headers this works fine, if they did have headers some work would be needed to strip those out.

Create a new database in SQL Server then follow these steps:

  1. Right Click the Database select “Tasks” – “Import Data”.
  2. In the Data Source step change the drop down to “Flat File Source”.
  3. Select the combined file we created above (you may have to change the filter).
  4. Check the Columns page if Quotation Marks (“) appear in some of the columns change the Text qualifier field on the General Page to a “.
  5. On the Advanced page click Suggest Types.
  6. Set the number of rows to 1000 (the maximum), then click OK.
  7. Go through each column and update the name and DataType to match those we discovered in the previous post.
  8. Check the correct database and table are selected on the next two steps.
  9. Click Next then Next again, then check over the data type mappings.
  10. Click Next then ensure Run immediately is checked then click finish.
  11. All being well, all of the data will be imported successfully.

If there are problems importing the data you can go back and make changes to the configuration, typically the issue is incorrect data types (too small) or incorrect text delimiters.

You may be asking why we went to tall that trouble, and time, only to let the Import Data Wizard suggest the data types. The reason I wrote the script was the wizard is limited to checking the first 1,000 lines; even if you set the value to 2,000,000 it will default down to 1000 after you move your focus away.

The result being if your data is naturally sorted on a specific column as some of the Ordinance Survey data appears to be the import will fail. Running the schema scanner allows you to scan through all of the data so that you can modify the suggested data types to match the maximum values.

Ordinance Survey OpenData (Part 1 – Schema Scanner)

Friday, December 3rd, 2010

In April 2010 the Ordinance Survey released certain parts of their data under special licence which allows for commercial use without cost. All the types of data made available are outside the scope of this post although I hope that the techniques described could be applied to any data set not limited to Ordinance Survey data.

In this post I am going to look at Code-Point Open, a list of all UK postcodes with their corresponding spatial positions. Unlike many other OS OpenData downloads the ZIP file does not contain the User Guide or the Schema Data, this can be found on the website, I spent a good 10 minutes searching for this data.

The term for what we are doing in this post is Extract-Transform -Load (ETL), a process in which we take data in one format and covert it for use in another format. Generally ETL is used to take a flat file format and load it for use in a relational database, although technically any format or database could be used. SQL Server offers two built-in mechanisms to perform ETL; the “Import Export Wizard” and SQL Server Integration Services (SSIS). The “Import Export Wizard” actually creates a SSIS package in the background and is available to all versions of SQL Server, SSIS  is not available in SQL Express.

Before we create a table in a SQL Server Database we need to know something about the data we are importing, the documentation for Code-Point Open tells us the data contains the following fields:

Postcode, Quality, Unused1, Unused2, Unused3, Unused4, Unused5, Unused6, Unused7, Unused8, Eastings, Northings, CountryCode, RegionalHealthAuthority, HealthAuthority, AdminCounty, AdminDistrict, AdminWard, Unused10

A number of the fields are not used, the fields and the dummy data held within them will be weeded out at a later date, we know the fields but we don’t know the format of the data it contains, it could be numeric, strings, decimals, telephone numbers? I created a PowerShell script which scans through all of these files to work out what type of field it is and the range of data held within it, be warned it will take a few hours to run!

# Schema Scanner v1.0
# ©2010 Richard Slater
# Create an empty hash table
$columns = @{}
# Loop through every file that matches this pattern
foreach ($file in Get-ChildItem -Path "D:\OSOpenData\Code-Point Open\data\ze.csv")
	Write-Host "Processing $file"
	# PowerShell Import-Csv cmdlet is pretty powerful, but if there is no header row you must feed it in
	$PostCodeData = Import-Csv $file -Header "Postcode","Quality","Unused1","Unused2","Unused3","Unused4","Unused5","Unused6","Unused7","Unused8","Eastings","Northings","CountryCode","RegionalHealthAuthority","HealthAuthority","AdminCounty","AdminDistrict","AdminWard","Unused10"
	# Go through each row in the file
	foreach($row in $PostCodeData)
		# Go through each column in the row
		foreach ($attr in (Get-Member -InputObject $PostCodeData[0] -MemberType NoteProperty))
			$key = $attr.Name
			# Ignore unused columns
			if ($key.StartsWith("Unused"))
				{ continue }
			# Construct an object to store the meta data, store it in the hash table to retreive next loop
			$column = New-Object PSObject
			if (!$columns.ContainsKey($key))
				$column | Add-Member -Type NoteProperty -Name StringLength -Value 0
				$column | Add-Member -Type NoteProperty -Name MaxValue -Value ([System.Int32]::MinValue)
				$column | Add-Member -Type NoteProperty -Name MinValue -Value ([System.Int32]::MaxValue)
				$columns.Add($key, $column)
				{ $column = $columns.Get_Item($key) }
			$isInt = $false
			$value = 0;
			# Work out if this is an integer type
			if ([System.Int32]::TryParse($row.($key), [ref] $value))
            	{ $isInt = $true }
			if (!$isInt)
				# it is not an integer how many characters is the string
            	if (($row.($key)).Length -gt $column.StringLength)
                	{ $column.StringLength = ($row.($key)).Length }
			# it is an integer start working out the maximum and minimum values
			if ( $value -gt $column.MaxValue ) { $column.MaxValue = $value }
			if ( $value -lt $column.MinValue ) { $column.MinValue = $value }
			$columns.Set_Item($key, $column)
# Print a report of all of the fields
foreach ($field in $columns.Keys)
	$stringLength = $columns[$field].StringLength
	$numericMax = $columns[$field].MaxValue
	$numericMin = $columns[$field].MinValue
	if ($stringLength -gt 0)
		Write-Host "$field (String) : Length =" $columns[$field].StringLength
	elseif (($numericMax -gt ([System.Int32]::MinValue)) -and ($numericMin -lt ([System.Int32]::MaxValue)))
		Write-Host "$field (Numeric) : MaxValue =" $numericMax ", MinValue =" $numericMin
		Write-Host "$field (Empty)"

The output from the script should give you enough information to construct a nice tight schema to import the data:

AdminWard (String) : Length = 2
AdminDistrict (String) : Length = 2
AdminCounty (Numeric) : MinValue = 0 , MaxValue = 47
Quality (Numeric) :  MinValue = 10 , MaxValue = 90
RegionalHealthAuthority (String) : Length = 3
Postcode (String) : Length = 7
Eastings (Numeric) : MinValue = 0 , MaxValue = 655448
Northings (Numeric) : MinValue = 0 , MaxValue = 1213660
CountryCode (Numeric) : = 64 ,  MaxValue   = 220
HealthAuthority (String) : Length = 3

In a future post I am going to take it to the next stage; create a table and complete the import with the Import Export Wizard. I would also like to improve the performance of the schema scanner by converting the code into C#.