Processing UK 2021 census open data with python tools |

Introduction

The census provides a rich set of demographic information that could be useful for various data science tasks, such as geomarketing, house price analysis etc. With the results of the 2021 census being published recently, this notebook demonstrates how to manipulate this data with FOSS python tools to build a dataset that can be used for downstream tasks.

For this demo, I will be focusing on output areas (OA) of London:

download the boundaries for these OA
download demographics datasets (population, household size, gender, education level)
convert the irregular geometries of the OA into 100m by 100m cells
calculate the avgerages for the demographics per bin
convert the cell data into raster

These are the packages we will need:

import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon
import numpy as np
from tqdm import tqdm
import fiona
import rasterio as rio
from geocube.api.core import make_geocube

Datasets

Population data

for Census 2021 according to Output Area (https://www.ons.gov.uk/filters/b9532b29-299e-4a23-9fc8-b99d68e172b9/dimensions)

Title: Population density Description: This dataset provides Census 2021 estimates that classify usual residents in England and Wales by population density (number of usual residents per square kilometre). The estimates are as at Census Day, 21 March 2021.

Output Areas (OAs) are the lowest level of geographical area for census statistics and were first created following the 2001 Census. Each OA is made up of between 40 and 250 households and a usually resident population of between 100 and 625 persons and may change after each census.

population_oa = pd.read_csv('data/uk-census-2021-oa.csv')
population_oa.head()

|Output Areas Code | Output Areas | Observation | |————————-|————–| | E00000001 | E00000001 | 26268.7 | | E00000003 | E00000003 | 60952.4 | | E00000005 | E00000005 | 12873.6 | | E00000007 | E00000007 | 1959.2 | | E00000010 | E00000010 | 71200.0 |

Geographic manipulation– OA boundaries to cells

To match the above OA codes with actual geography, we have to obtain the boundaries of each output area– these can be found on UK’s goverment geoportal– note that I am using the ‘generalised’ version instead of the full version.

boundaries = gpd.read_file('data/Output_Areas_(Dec_2021)_Boundaries_Generalised_Clipped_EW_(BGC)/Output_Areas_(Dec_2021)_Boundaries_Generalised_Clipped_EW_(BGC).shp')

boundaries.head()

OBJECTID	OA21CD	GlobalID	Shape__Are	Shape__Len	geometry
1	E00000001	bc5eb21b-d42b-4715-a771-2c27575a08f0	6949.151482	421.166161	POLYGON ((532303.492 181814.110, 532213.378 18…
2	E00000003	a1a2b34f-320e-4bb8-acb4-7ca7ca16ef9c	4492.411072	307.714653	POLYGON ((532213.378 181846.192, 532190.539 18…
3	E00000005	9337da1a-fe0f-4210-9c95-ed2d20fd6287	8565.514214	385.204781	POLYGON ((532180.131 181763.020, 532219.161 18…
4	E00000007	b336e11f-af26-48a6-ac67-44f5b8b8840a	75994.829704	1408.607657	POLYGON ((532201.292 181668.180, 532267.728 18…
5	E00000010	ca8f9874-cdf5-4c1a-9d39-f74a410dae44	2102.876602	215.271975	POLYGON ((532127.958 182133.192, 532089.264 18…

In the boundaries geodataframe, the columns we’ll need is that ‘OA21CD’, ‘Shape_Are’ and ‘geometry columns’: ‘OA21CD’ can join to the population dataset from above, Shape__Are gives the area of the OA in m^2 (note that this is different from the poulation measurement unit, which is in km^2). We can in fact cross check it against the geometry column

# check that the projection is indeed british national grid eastings and northings
print(f"geodata has projection of {boundaries.crs}") # this gives: epsg:27700
# check that the geometry gives ~the same area as what shape_are gives to a 10m^2 tolerance
assert ((boundaries.area.astype(int)- boundaries.Shape__Are.astype(int)).abs() < 10).sum() == len(boundaries)

Since we are only interested in Greater London, we can filter out rows from the boundaries dataframe that are outside of the london bounding box.

We can get the bounding box of london using the OSM query https://nominatim.openstreetmap.org/search.php?city=london&country=uk&format=jsonv2 (the first entry is the correct one), giving a boundingbox of ymin, ymax, xmin, xmax of “51.2867602”,“51.6918741”,"-0.5103751",“0.3340155”

ldn_bounding_box = Polygon.from_bounds( -0.5103751, 51.2867602,0.3340155, 51.691874 )
ldn_bounding_box = gpd.GeoSeries([ldn_bounding_box], crs='epsg:4326').to_crs(boundaries.crs)
ldn_bounding_box.geometry

    0    POLYGON ((503976.311 155234.131, 503059.759 20...
    dtype: geometry

Convert the geodataframe into a grid for downstream tasks

Ordnance survey actually has a file of british grids at various resolutions prebuilt (https://github.com/OrdnanceSurvey/OS-British-National-Grids), – the smallest is 1km by 1km, so we will construct our own 100m x 100m grid for london (We can do it for the whole uk, but we will have to iterate through the dataset as the number of cells will become very large)

diversion

This demos how to use the grid gpkg file from OS– the gpkg is a vector file with multiple ’layers’ the same way a raster file can have bands

# get the layer names
list(fiona.listlayers('data/os_bng_grids.gpkg'))

['100km_grid', '50km_grid', '20km_grid', '10km_grid', '5km_grid', '1km_grid']

layername = '1km_grid'
bsng_grid_1km = gpd.read_file('data/os_bng_grids.gpkg', layer=layername)
bsng_grid_1km.head()

tile_name	geometry
HL0000	POLYGON ((0.000 1200000.000, 1000.000 1200000….
HL0001	POLYGON ((0.000 1201000.000, 1000.000 1201000….
HL0002	POLYGON ((0.000 1202000.000, 1000.000 1202000….
HL0003	POLYGON ((0.000 1203000.000, 1000.000 1203000….
HL0004	POLYGON ((0.000 1204000.000, 1000.000 1204000….

Back to our task, we filter out the OA that intersects the london bounding box:

ldn_boundaries = boundaries[['OA21CD', 'geometry']].sjoin(gpd.GeoDataFrame(ldn_bounding_box.reset_index()), how='inner')[['OA21CD', 'geometry']]
ldn_boundaries.head()

OA21CD	geometry
E00000001	POLYGON ((532303.492 181814.110, 532213.378 18…
E00000003	POLYGON ((532213.378 181846.192, 532190.539 18…
E00000005	POLYGON ((532180.131 181763.020, 532219.161 18…
E00000007	POLYGON ((532201.292 181668.180, 532267.728 18…
E00000010	POLYGON ((532127.958 182133.192, 532089.264 18…

ldn_bounding_box.geometry[0].bounds

(503059.75944664044, 155234.13110756868, 562854.9611320551, 201813.5769950911)

# create a 100m by 100m grid for london
cellsize=100
# British National Grid extent (0,0 700000,1300000)

xmin, ymin,  xmax, ymax = [100 * (x//100 )for x in ldn_bounding_box.geometry[0].bounds] 
matrix = np.mgrid[int(xmin):int(xmax + cellsize):cellsize, int(ymin):int(ymax+ cellsize):cellsize]

xcoor = matrix[0].flatten()
ycoor = matrix[1].flatten()

xcoor_max = xcoor +cellsize
ycoor_max = ycoor + cellsize

polygons = []
for x, y, xmax, ymax in tqdm(zip (xcoor, ycoor, xcoor_max, ycoor_max), total=len(ycoor)):
    polygons.append(Polygon.from_bounds(x, y, xmax, ymax))
grid = gpd.GeoDataFrame(data={'cellid': np.arange(len(polygons))},geometry=polygons, crs=boundaries.crs)

Join the grid to the OA polygons to get the demographics info per grid cell

print(grid.shape)
print(ldn_boundaries.shape)

ldn_demographics_grid = grid.sjoin(boundaries)

(279733, 2)
(30160, 2)

# as we can see, the OA size span is quite large --luckily most OA are larger than a grid cell. 
ax =ldn_demographics_grid.drop_duplicates(subset='OA21CD').Shape__Are.apply(np.log10).hist(bins = 100)
ax.set_xlabel('OA area m^2')
ax.set_ylabel('number of OA')
ax.vlines(4,0,1800, colors='b', label='10,000 m^2 (Size of cell)')
ax.legend()

png

There are clearly duplicates– e.g. when a grid cell span more than 1 OA, and where 1OA span more than 1cell– this is more likely for the OA to span mutliple cells than the other way round. Base on this assumption, we assign each cell to one OA by dropping duplicadtes

ldn_demographics_grid = ldn_demographics_grid.drop_duplicates(subset='cellid')

# drop all the unneed cols
ldn_demographics_grid = ldn_demographics_grid[['cellid', 'OA21CD', 'geometry']]

Finally, let’s populate these grid cells with demographic data

household size by OA
education
[population density]((https://www.ons.gov.uk/filters/b9532b29-299e-4a23-9fc8-b99d68e172b9/dimensions)
sex

We have discussed the population data above, let’s now tidy up the other datasets

household size

houshold = pd.read_csv('data/household_size_output_area_census2021.csv')
houshold.head()

Output Areas Code	Output Areas	Household size (9 categories) Code	Household size (9 categories)	Observation
E00000001	E00000001	0	0 people in household	0
E00000001	E00000001	1	1 person in household	34
E00000001	E00000001	2	2 people in household	44
E00000001	E00000001	3	3 people in household	10
E00000001	E00000001	4	4 people in household	6

We can use the ‘code’ as a proxy for household size

houshold[['Household size (9 categories)', 'Household size (9 categories) Code']].drop_duplicates()

Household size (9 categories)	Household size (9 categories) Code
0 people in household	0
1 person in household	1
2 people in household	2
3 people in household	3
4 people in household	4
5 people in household	5
6 people in household	6
7 people in household	7

mean_household_size = houshold[['Output Areas Code', 'Household size (9 categories) Code','Observation']]
mean_household_size['counts'] = mean_household_size['Household size (9 categories) Code'] * mean_household_size['Observation']
mean_household_size = mean_household_size.groupby('Output Areas Code')['counts'].sum()
mean_household_size_totals = houshold[['Output Areas Code', 'Household size (9 categories) Code','Observation']].groupby('Output Areas Code')['Observation'].sum()

mean_household_size = mean_household_size.reset_index().merge(mean_household_size_totals.reset_index(), on='Output Areas Code')
mean_household_size['avg_household_size'] = mean_household_size['counts'] /  mean_household_size['Observation']
mean_household_size.head()

Output Areas Code	counts	Observation	avg_household_size
E00000001	176	94	1.872340
E00000003	256	109	2.348624
E00000005	112	63	1.777778
E00000007	145	87	1.666667
E00000010	173	125	1.384000

gender

gender = pd.read_csv('data/uk-census-2021-sex-oa.csv')
females = gender[gender['Sex (2 categories)'] == 'Female']
gender_total = gender.groupby(['Output Areas Code'])['Observation'].sum().reset_index()
percent_female = females[['Output Areas', 'Observation']].merge(gender_total, left_on='Output Areas', right_on='Output Areas Code')
percent_female['pct'] = percent_female['Observation_x'] / percent_female['Observation_y']
percent_female.head()

Output Areas	Observation_x	Output Areas Code	Observation_y	pct
E00000001	75	E00000001	176	0.426136
E00000003	123	E00000003	258	0.476744
E00000005	52	E00000005	112	0.464286
E00000007	67	E00000007	144	0.465278
E00000010	71	E00000010	178	0.398876

education

education = pd.read_csv('data/uk-census-2021-education-oa.csv')
education[['Highest level of qualification (8 categories) Code','Highest level of qualification (8 categories)'	]].drop_duplicates()

Highest level of qualification (8 categories) Code	Highest level of qualification (8 categories)
-8	Does not apply
0	No qualifications
1	Level 1 and entry level qualifications: 1 to 4…
2	Level 2 qualifications: 5 or more GCSEs (A* to…
3	Apprenticeship
4	Level 3 qualifications: 2 or more A levels or …
5	Level 4 qualifications and above: degree (BA, …

here make the simplifying assumption that we ignore the ‘does not apply category’, and the categories code go up in more advanced education

mean_education_level = education[['Output Areas Code', 'Highest level of qualification (8 categories) Code','Observation']]
mean_education_level['counts'] = mean_education_level['Highest level of qualification (8 categories) Code'].clip(lower=0) * mean_education_level['Observation']
mean_education_level = mean_education_level.groupby('Output Areas Code')['counts'].sum()
mean_education_level_totals = education[['Output Areas Code', 'Highest level of qualification (8 categories) Code','Observation']].groupby('Output Areas Code')['Observation'].sum()
mean_education_level = mean_education_level.reset_index().merge(mean_education_level_totals.reset_index(), on='Output Areas Code')
mean_education_level['avg_education_level'] = mean_education_level['counts'] /  mean_education_level['Observation']
mean_education_level.head()

Output Areas Code	counts	Observation	avg_education_level
E00000001	714	175	4.080000
E00000003	1022	257	3.976654
E00000005	485	113	4.292035
E00000007	640	144	4.444444
E00000010	656	179	3.664804

Joining all info together

Finally, we can join the demograhics data onto the geomtry to get gridded demographics data

# add education level info
ldn_demographics_grid_processed = ldn_demographics_grid.merge(mean_education_level[[
    'Output Areas Code', 'avg_education_level']], left_on='OA21CD', right_on='Output Areas Code').drop(columns='Output Areas Code')
# add % female
ldn_demographics_grid_processed = ldn_demographics_grid_processed.merge(percent_female[[
    'Output Areas Code', 'pct']], left_on='OA21CD', right_on='Output Areas Code').drop(columns='Output Areas Code').rename(columns={'pct': 'pct_females'})
# add mean household size
ldn_demographics_grid_processed = ldn_demographics_grid_processed.merge(mean_household_size[[
    'Output Areas Code', 'avg_household_size']], left_on='OA21CD', right_on='Output Areas Code').drop(columns='Output Areas Code')


ldn_demographics_grid_processed = ldn_demographics_grid_processed.merge(population_oa[[
    'Output Areas Code', 'Observation']], left_on='OA21CD', right_on='Output Areas Code').drop(columns='Output Areas Code').rename(
    columns={'Observation': 'pop_density'})
ldn_demographics_grid_processed['popoulation'] = ldn_demographics_grid_processed['pop_density'] * (100 * 100)/(1000 * 1000)

Rasterizing geodataframe

Certain tools work better with raster data, and since we have converted our data from a irregular geometry into cells, we can go all the way and convert the vector data into raster, using geocube.

Capping/binning population values

The population values have a fairly long tail– let’s cap this so it is easier to plot and handle


ldn_demographics_grid_processed['popoulation'].quantile([0.95])

We can see 95% cells have < 150 people

     Name: popoulation, dtype: float64,
     0.95    130.5817

ldn_demographics_grid_processed['popoulation'] = ldn_demographics_grid_processed['popoulation'].clip(lower=0, upper=150)

Now we convert the vector to raster (the resolution we are passing to geocube is what we are using for each of our cells– 100m )

ldn_demographics_raster = make_geocube(
    vector_data=ldn_demographics_grid_processed,
    measurements=["popoulation", "avg_education_level", "pct_females"],
    resolution=(-100, 100),
)

We can also plot our rasters (always a nice sanity check)

ldn_demographics_raster['popoulation'].plot()

png

ldn_demographics_raster['pct_females'].plot()

png

ldn_demographics_raster['avg_education_level'].plot()

png

# save our raster data
ldn_demographics_raster.rio.to_raster("data/ldn_demographics_raster.tiff")

Introduction#

Datasets#

Population data#

Geographic manipulation– OA boundaries to cells#

Convert the geodataframe into a grid for downstream tasks#

household size#

gender#

education#

Joining all info together#

Rasterizing geodataframe#