Tutorial: Exploring raster and vector geographic data with rasterio and geopandas

(Post originally published in steemit)

Repository

https://github.com/mapbox/rasterio

What Will I Learn?

  • You will learn how to manipulate geographic data with geopandas and rasterio.
  • You will learn how to integrate vector geographic data with raster files using rasterio and geopandas.

Requirements

  • Intermediate python knowledge.
  • Basic rasterio and geopandas knowledge.

Difficulty

  • Intermediate

Rationale

This is a small project project of geographic data exploration. The main tools for this task are: Rasterio and Geopandas. This analysis began as an attempt to measure the access to public ways in each of Guatemala municipalities. First I have used Open Street Maps data, obtained from here. I also have at hand Guatemala’s roads official data, obtained from the national institute of geography.. When comparing to official roads data, it can be seen OSM does not have all roads, also, OSM roads data contains streets for some places and has a very confusing set of categories. In general OSM roads data can be misleading. I will show you both and the code I have used to manipulate both datasets. The outcome measurement of this analysis is the proportion of people that live near a road. This outcome may be useful to analyse health seeking beahviour, for instance, or in general to analyse the population access to economical and social activities such as labor, consumption, trading, education and transportation.

First I am going to show the analysis on OSM data and after that I will show the IGN data analysis. Both analysis are the same. The only change is the roads data used.

For population estimations I have used Worldpop project data. I have introduced this before. That post also contains an introduction to rasterio. There is a possible issue with this analysis: Worldpop project used roads data to estimate population density. However, there are lots of other covariates that worldpop uses to model population density, such as poverty, night lights, deforestation, and many others. Even though this analysis is unconclusive and generates more questions than answers. it has been a great exercise to learn more about Fiona, Rasterio and GeoPandas.

Using OSM data:

import fiona
import rasterio
import rasterio.features
from rasterio import mask
import pandas as pd
import geopandas
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mlp
import scipy.ndimage as ndi
import re

%matplotlib inline

Reading the roads shapefile with geopandas is as simple as this:

gtroads_osm = geopandas.read_file("../../../DATOS/OSM/hotosm_gtm_roads_lines.shp/hotosm_gtm_roads_lines.shp")

It uses fiona to read and converts the data to a GeoDataFrame object, which allows the data to be manipulated with Pandas syntax. Let’s take a quick look to the data:

image.png

Now load the Worldpop raster file:

wpgt_file = "../../../DATOS/WorldPop/GTM_ppp_v2b_2015/GTM_ppp_v2b_2015.tif"
wpgt_r = rasterio.open(wpgt_file)
wpgt = wpgt_r.read()

wpgt_r is a RasterReader object, while wpgt is the first band read from the raster file. A raster file can contain more than one band. A band is a 2D array of values. In this case, the WorldPop band contains the estimated population for a ~100m x 100m square.

Now one of the main things I want to show in this tutorial is the way to convert a vector geographical object to a raster:

rasterizado = rasterio.features.rasterize(
    [(x.geometry, 1) for i, x in gtroads_osm.iterrows()],
    out_shape=wpgt_r.shape,
    transform=wpgt_r.transform,
    fill=0,
    all_touched=True,
    dtype=rasterio.uint8, )

Rasterio features module has a rasterize function that allows you to convert a vector object to an image. The first argument of this function is a list of tuples. Each tuple has two members, the first is the geometry to be rasterized and the second member is the value that you want to give to that geometry. Next we define the shape of the output band, the transform (both of these are just taken from the original worldpop raster file), the fill value for pixels that have no rasterized vector, the all_touched option draws in all pixels that are touched by the geometries.

To manipulate this new band, I have had to save it in a raster file. This is because there is no way to put a band in memory in a Raster object in rasterio. So the band is saved in a new raster file, and then loaded again.

# To write the rasterized map to a file:
profile = wpgt_r.profile
profile.update(
    dtype=rasterio.uint8,
    count=1,
    compress='lzw', 
    nodata=0)
rasterizado[rasterizado < 1] = 0
with rasterio.open("GtRoads_OSM_100m_x_100m.tif", 'w', **profile) as out:
    out.write_band(1, rasterizado)

As you can see, you need a profile to make a raster file. We copy the profile from the worldpop raster object and then we use it to oepn a rasterio file writer object which exports our new band.

# I have not found a way to get a raster from in memory data. So I have loaded it from the file written above.
gtroads_osm_raster = rasterio.open("GtRoads_OSM_100m_x_100m.tif", 'r')
gtroads_osm_r = gtroads_osm_raster.read()

To get a mask of the raster pixels that are close to a pixel containing a road, I have used a max filter from scipy’s ndimage library.

gtroads_osm_filtered = ndi.maximum_filter(gtroads_osm_r, size= 5 )

I have used a size of 5, which means that it makes a filter centered in each pixel with size 5×5. This will make a mask of geographical regions that are at between 200m and 300m of distance from a road, because each pixel has been assumed to have this size: 100m x 100m. This is a raw approximation, but let’s stick with it for the sake of simplicity.

Now I use this mask to get the population that lives at ~ 300m or less from a road. Roads band is composed of 1s and 0s. To get the populations of the cells with 1, we may just multiply the worldpop band by the roads filtered band:

gtroads_access_osm = wpgt[0] * gtroads_osm_filtered[0]

plt.rcParams['figure.figsize'] = 14, 14
plt.imshow(np.log10((np.fmax( gtroads_access_osm+1, 1))), vmin = 0, interpolation="bilinear")
bar = plt.colorbar(fraction=0.03)
bar.set_ticks([0, 0.301, 1.041, 1.708, 2.0043])
bar.set_ticklabels([0, 1, 10, 50, 100])

image.png

Note that matplotlib’s imshow function allows us to plot this 2D large array. I prefer to use transform the data with the log10 function because it has a wide range and cells with low population count become invisible without this rescaling. This is the worldpop dataset masked by proximity to guatemalan roads and streets from OSM data. It looks very nice, but it is better to have the actual numbers by municipality, so let’s now import the municipalities geometries and aggregate this raster by them.

# Now let's get Guatemala municipalities shapes
gpmunis_json = fiona.open("../../../DATOS/IGN/GT-IGN-cartografia_basica-Division politica Administrativa (Municipios).geojson", "r", "GeoJSON")
gpmunis = geopandas.GeoDataFrame.from_features(gpmunis_json)

To aggregate the data points that are contained in each municipality polygon shape, we use the mask module. The mask function takes a raster band and a geometry. It generates a masked array subr and the bounds of the geometry within the raster, bounds.

munis_pop = []
for muni in gpmunis_json:
    subr, bounds = mask.mask(wpgt_r, [muni["geometry"]])
    munis_pop.append({
        "road_access": np.sum(gtroads_access_osm[subr.data[0]>0]),
        "total_pop": np.sum(subr.data[subr.data>0]),
        "codigo": muni["properties"]["COD_MUNI__"]
    }) 
    print("processes muni ", muni["properties"]["COD_MUNI__"])

As you can see, I have masked only over the worldpop raster band. To get the population close to roads (road_access) I use the subr mask as a filter (it filters only the positive values greater than 0) over the gtroads_access_osm array, which contains the population counts for cells close to roads. All this data is put in a list of dictionaries and then it is converted to a dataframe and merged with the municipalities geometries dataset.

mpdf = pd.DataFrame(munis_pop)
mpdf["perc_access"] = mpdf.road_access / mpdf.total_pop
gpmunis_osm = gpmunis.copy().merge(mpdf, left_on="COD_MUNI__", right_on="codigo", how="left")

Now we can visualize this in a nice map easily with GeoPandas. Note that GeoPandas takes care of everything, including the legend of the choropleth:

plt.rcParams['figure.figsize'] = 14, 14

plot = gpmunis_osm.plot(column = "perc_access", legend= True)
plot.legend(loc=2, prop={'size': 6})

image.png

Using IGN Data

Now I want to repeat this process with IGN data in order to compare between these two. IGN data has only roads and will be quite different. Let’s see:

gtroads_ign = geopandas.read_file("../../../DATOS/IGN/IGN-cartografia_basica-Red de CArreteras.geojson", 
                                  mode="r", driver="GeoJSON")

print(len(gtroads_ign))
gtroads_ign.head()

image.png

It is unfortunate that IGN data has the properties encoded in raw html. So I made a quick parser to manage this inconvenient:

prev = None
def process_description_in_ugly_ign_format(input_str):
    if not input_str.startswith("Continuation"):
        return pd.Series(index = re.findall("atr-name\"\>([\w ]*)\<", input_str), 
                         data = re.findall("atr-value\"\>([^\<]*)\<", input_str))
    return pd.Series([])
    
properties = \
    gtroads_ign.description.apply(process_description_in_ugly_ign_format).fillna(method="ffill")
gtroads_ign = gtroads_ign.merge(properties, left_index=True, right_index=True)

I make use of regular expressions to find properties names and values. I put them in a pandas DataFrame and then I merge it with the original GeoDataFrame.

image.png

Now I perform the process described above for the OSM data:

rgtroads_ign = rasterio.features.rasterize(
    [(x.geometry, 1) for i,x in gtroads_ign.iterrows()],
    out_shape=wpgt_r.shape,
    transform=wpgt_r.transform,
    fill=0,
    all_touched=True,
    dtype=rasterio.uint8, )

# To write the rasterized map to a file:
profile = wpgt_r.profile
profile.update(
    dtype=rasterio.uint8,
    count=1,
    compress='lzw', 
    nodata=0)
rgtroads_ign[rgtroads_ign < 1] = 0
with rasterio.open("GtRoads_IGN_100m_x_100m.tif", 'w', **profile) as out:
    out.write_band(1, rgtroads_ign)
    
gtroads_raster_ign = rasterio.open("GtRoads_IGN_100m_x_100m.tif", 'r')
gtroads_ign_rdata = gtroads_raster_ign.read()

gtroads_ign_filtered = ndi.maximum_filter(gtroads_ign_rdata, size= 5 )

gtroads_access_ign = wpgt[0] * gtroads_ign_filtered[0]

plt.rcParams['figure.figsize'] = 14, 14
plt.imshow(np.log10((np.fmax( gtroads_access_ign+1, 1))), vmin = 0, interpolation="bilinear")
bar = plt.colorbar(fraction=0.03)
bar.set_ticks([0, 0.301, 1.041, 1.708, 2.0043])
bar.set_ticklabels([0, 1, 10, 50, 100])

image.png

You can see it looks quite different to the OSM rasterized map. Much less streets and roads. However, you can easily find roads that are missing in the OSM dataset (specially in Quiché district).

munis_acc_ign = []
for muni in gpmunis_json:
    subr, bounds = mask.mask(wpgt_r, [muni["geometry"]])
    munis_acc_ign.append({
        "road_access" : np.sum(gtroads_access_ign[subr.data[0]>0]),
        "total_pop"   : np.sum(subr.data[subr.data>0]),
        "Codigo"      : muni["properties"]["COD_MUNI__"]
    })
    print("processes muni ", muni["properties"]["COD_MUNI__"])

mpdf_ign = pd.DataFrame(munis_acc_ign)
mpdf_ign["perc_access"] = np.round(100 * mpdf_ign.road_access / mpdf_ign.total_pop)

gpmunis_ign = gpmunis.copy().merge(mpdf_ign, left_on="COD_MUNI__", right_on="Codigo", how="left")

plt.rcParams['figure.figsize'] = 14, 14

plot = gpmunis_ign.plot(column = "perc_access", legend= True)
plot.legend(loc=2, prop={'size': 6})

image.png

Conclusion

In this tutorial I have shown some of the features in rasterio for masking, rasterizing and manipulation of raster files. I have made minor use of geopandas package to load GIS geometries in GeoJSON and Shapefile formats and to make a choropleth visualization without much effort. It has been shown that these tools , rasterio, fiona, geopandas, numpy, scipy, and others, can be used together to manipulate vector and rasterized geographical data.

Curriculum

Proof of Work Done

https://github.com/Guitlle/Guatemala-roads-access