class: center, middle, inverse, title-slide # Reproducible geo-spatial analysis based on census and other Canadian data ### Jens von Bergmann ### 2019-11-13 --- ## Reproducible geo-spatial analysis based on census and other Canadian data -- * **Reproducible**: can be repeated by others with minimal work (also, can be adapted and tweaked easily) -- * **geo-spatial** analysis: data has a spatial component, which requires appropriate statistical methods -- * **census** and other **Canadian** data: Accessing and pre-processing data is a large part of data analysis, Canadian data infrastructure is still in it's infancy and requires acquisition and processing infrastructure ??? These points push us outside of Desktop GIS software and toward R or Python. For this talk we will work in R --- ## Tools: * [*R* programming language](https://www.r-project.org) * [`cancensus` package](https://mountainmath.github.io/cancensus/) to access census data via the [CensusMapper API](https://censusmapper.ca/api) * [`cansim` package](https://mountainmath.github.io/cansim/) to access census data via the [StatCan NDM API](https://www.statcan.gc.ca/eng/developers/wds) * [`tongfen` package](https://github.com/mountainMath/tongfen) to normalize multi-year census data to a common geography * [`sf` package](https://github.com/r-spatial/sf) for standard spatial operations * [`spatialreg` package](https://github.com/r-spatial/spatialreg) for statistical spatial modelling * [`ei` package](https://cran.r-project.org/web/packages/ei/ei.pdf) for ecological inference modelling --- # CensusMapper CensusMapper is my answer to the inaccessibility of census data by non-experts. It allows instant and flexible mapping of census data. Canada wide. Maps can be narrated, saved and shared. By anyone. --- background-image: url("https://doodles.mountainmath.ca/images/net_van.png") background-position: 50% 50% background-size: 100% class: center, bottom, inverse # <a href="https://censusmapper.ca/maps/731" target="_blank">CensusMapper Demo</a> ??? Lots of hidden features too that aren't accessible to general public. Don't have the resources to make them more user-friendly and release to public free to use. --- # Maps aren't analysis CensusMapper has APIs to facilitate deeper analysis. Open for all to use. [`cancensus`](https://github.com/mountainMath/cancensus) is an R package that seamlessly integrates census data into data analysis in R. Let's try and understand the effects of the net migration patterns by age on the age distribution. ??? While we do need better data, we don't make good use of the data we already have. What's needed most is analysis. --- # Age pyramids How does the net migration effect the age distribution in each municipality? ```r plot_data <- get_age_data('CA16',list(CSD=c("5915022","5915004","5915055"))) ggplot(plot_data, aes(x = Age, y = Population, fill = Gender)) + geom_bar(stat="identity") + age_pyramid_styling + facet_wrap("`Region Name`",nrow=1, scales="free_x") ``` <!-- --> --- ## How does this work? Let's walk through an example. ### Today's toy question: **How has residential mobility changed over the past 10 years?** The census tracks how many people lived at the same residence as they did 5 years prior. We can take the 2001-2006 and 2011-2016 time frames to see how the share of people saying in the same residence has changed. And explore geographic differences. ??? We could expand on this to try and understand what causes differences that we may find. --- background-image: url("/images/api_tool.png") background-position: 50% 50% background-size: 100% class: center, bottom, inverse # How to get the data? ## <a href="https://censusmapper.ca/api" target="_blank">CensusMapper API Demo</a> ??? Getting data is often half the work. I have spent quite a bit of time building tools to make this easier. --- ### Data import ```r mobility_variables <- c(non_movers_CA16="v_CA16_6722",priv_pop_CA16="v_CA16_424", non_movers_CA06="v_CA06_461",priv_pop_CA11="v_CA11F_216", non_movers_CA11="v_CA11N_1747",priv_pop_CA06="v_CA06_134", non_movers_CA01="v_CA01_391",priv_pop_CA01="v_CA01_127") top_cities <- list_census_regions("CA16") %>% filter(level=="CSD",!(name %in% c("Montréal","Québec"))) %>% top_n(10,pop) %>% as_census_region_list() van <- list(CSD=c("5915022","5915803"), CT=c("9330069.00","9330069.01","9330069.02")) regions <- list(CMA="59933") compute_mover_change <- function(data) data %>% mutate_at(vars(matches("movers_|priv_pop")),list(~ifelse(.==0,NA,.))) %>% mutate(`2001-2006`=1-non_movers_CA06/priv_pop_CA01, `2006-2011`=1-non_movers_CA11/priv_pop_CA06, `2011-2016`=1-non_movers_CA16/priv_pop_CA11) %>% mutate(`Change in Movers`=`2011-2016`-`2001-2006`) ``` --- ## City level data ```r city_data <- get_census("CA16",regions=top_cities,vectors=mobility_variables) %>% compute_mover_change() %>% pivot_longer(c("2001-2006","2006-2011","2011-2016")) ggplot(city_data,aes(x=`Region Name`,y=value,fill=name)) + geom_bar(stat="identity",position="dodge") + bar_theme + labs(y="Share of Movers",title="Share of population moving",fill="Period") ``` <!-- --> ??? Note we removed Montréal and Québec, these cities have had significant boundary changes and require more careful attention. --- ## Geographic breakdown Overall mobility declined everywhere 2001-2006 to 2011-2016. The 2006-2011 data is based on the NHS and likely biases low. Focus on Vancouver. Take census tracts as base for *neighbourhoods*. ```r cov_geos <- lapply(seq(2001,2016,5),function(y) get_census(paste0("CA",substr(y,3,4)),regions=van,level="CT",geo_format="sf") %>% select(GeoUID,geometry) %>% mutate(Year=y)) %>% do.call(rbind,.) ggplot(cov_geos) + geom_sf() + facet_wrap("Year",nrow=1) + coord_sf(datum=NA) ``` <!-- --> --- ## Enter TongFen (通分) Need a common geography for 2001 through 2016 data, and aggregate data to this common geography. This is a common obstacle. [TongFen](https://github.com/mountainMath/tongfen) automates this. -- ```r movers_data <- get_tongfen_census_ct_from_da(regions, vectors=mobility_variables, geo_format = 'sf') %>% compute_mover_change() movers_data %>% select(matches("TongfenID|movers_CA")) %>% st_set_geometry(NULL) %>% head(5) %>% knitr::kable("html") ``` <table> <thead> <tr> <th style="text-align:left;"> TongfenID </th> <th style="text-align:right;"> non_movers_CA01 </th> <th style="text-align:right;"> non_movers_CA06 </th> <th style="text-align:right;"> non_movers_CA11 </th> <th style="text-align:right;"> non_movers_CA16 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 9330001.01 </td> <td style="text-align:right;"> 5320 </td> <td style="text-align:right;"> 5200 </td> <td style="text-align:right;"> 5465 </td> <td style="text-align:right;"> 5465 </td> </tr> <tr> <td style="text-align:left;"> 9330001.02 </td> <td style="text-align:right;"> 1865 </td> <td style="text-align:right;"> 1970 </td> <td style="text-align:right;"> 2735 </td> <td style="text-align:right;"> 2750 </td> </tr> <tr> <td style="text-align:left;"> 9330002.01 </td> <td style="text-align:right;"> 2870 </td> <td style="text-align:right;"> 2960 </td> <td style="text-align:right;"> 3295 </td> <td style="text-align:right;"> 3535 </td> </tr> <tr> <td style="text-align:left;"> 9330002.02 </td> <td style="text-align:right;"> 4130 </td> <td style="text-align:right;"> 4280 </td> <td style="text-align:right;"> 5215 </td> <td style="text-align:right;"> 4950 </td> </tr> <tr> <td style="text-align:left;"> 9330003.01 </td> <td style="text-align:right;"> 1715 </td> <td style="text-align:right;"> 1860 </td> <td style="text-align:right;"> 2330 </td> <td style="text-align:right;"> 1985 </td> </tr> </tbody> </table> --- ## Visual inspection ```r map_data <- movers_data %>% gather(key="Period", value, c("2001-2006","2011-2016")) ggplot(map_data) + geom_sf(aes(fill=value),size=0.25) + facet_wrap("Period") + labs(title="Share of movers") + map_theme_m_1 ``` <!-- --> --- ## Geography of change ```r ggplot(movers_data) + geom_sf(aes(fill=`Change in Movers`),size=0.25) + labs(subtitle="Percentage point change 2001-2006 vs 2011-2016") + map_theme_m_2 ``` <!-- --> --- ## City of Vancouver ```r cov_md <- movers_data %>% filter(TongfenID %in% cov_geos$GeoUID) map_data <- cov_md %>% gather(key="Period", value, c("2001-2006","2011-2016")) ggplot(map_data) + geom_sf(aes(fill=value)) + facet_wrap("Period") + map_theme_1 + labs(title="Share of movers") ``` <!-- --> --- ## City of Vancouver change ```r ggplot(cov_md) + geom_sf(aes(fill=`Change in Movers`)) + map_theme_2 + labs(subtitle="Percentage point change in share of movers 2001-2006 vs 2011-2016") ``` <!-- --> --- ## Question What could explain (loss of) residential mobility? -- * tenure * age * housing costs * income -- This is the point when should get a custom tabulation. Before doing that, there are two routes that can help us: * PUMF data (synthetic 1:40 subsample of the census, CMA level only) * Ecological inference (exploit geographic variation to estimate individual level effects) ```r tenur_vars <- c(owner_CA11="v_CA11N_2253",tenure_base_CA11="v_CA11N_2252", owner_CA01="v_CA01_99",tenure_base_CA01="v_CA01_96") tenure_data <- get_tongfen_census_ct_from_da(regions=regions, vectors=tenur_vars) %>% mutate(owner_share_CA11=owner_CA11/tenure_base_CA11, owner_share_CA01=owner_CA01/tenure_base_CA01) %>% select(matches("TongfenID|_share_|owner_")) all_data <- movers_data %>% left_join(tenure_data,by="TongfenID") ``` --- ## Ecological inference ```r ei_data_01_06 <- all_data %>% filter(!is.na(`2001-2006`),!is.na(owner_share_CA01)) ei_01_06 <- ei::ei(formula = formula_01_06, data=ei_data_01_06, id="TongfenID", total="Households_CA01") ei_data_11_16 <- all_data %>% filter(!is.na(`2011-2016`),!is.na(owner_share_CA11)) ei_11_16 <- ei::ei(formula = formula_11_16, data=ei_data_11_16, id="TongfenID", total="Households_CA11") ``` .pull-left[ #### PUMF data <table> <thead> <tr> <th style="text-align:left;"> Tenure </th> <th style="text-align:left;"> 2001-2006 </th> <th style="text-align:left;"> 2011-2016 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Owner </td> <td style="text-align:left;"> 35.8% </td> <td style="text-align:left;"> 34.3% </td> </tr> <tr> <td style="text-align:left;"> Renter </td> <td style="text-align:left;"> 72.0% </td> <td style="text-align:left;"> 64.9% </td> </tr> </tbody> </table> ] .pull-right[ #### Ecological Inference estimates <table> <thead> <tr> <th style="text-align:left;"> Tenure </th> <th style="text-align:left;"> 2001-2006 </th> <th style="text-align:left;"> sd_01 </th> <th style="text-align:left;"> 2011-2016 </th> <th style="text-align:left;"> sd_11 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Owner </td> <td style="text-align:left;"> 34.9% </td> <td style="text-align:left;"> 0.4% </td> <td style="text-align:left;"> 33.5% </td> <td style="text-align:left;"> 0.3% </td> </tr> <tr> <td style="text-align:left;"> Renter </td> <td style="text-align:left;"> 69.7% </td> <td style="text-align:left;"> 0.7% </td> <td style="text-align:left;"> 65.1% </td> <td style="text-align:left;"> 0.6% </td> </tr> </tbody> </table> ] --- ## How does Ecological Inference work? .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- ## Mapping the results <!-- --> <table> <thead> <tr> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Owners 01-06 </th> <th style="text-align:left;"> Owners 11-16 </th> <th style="text-align:left;"> Renters 01-06 </th> <th style="text-align:left;"> Renters 11-16 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> CoV </td> <td style="text-align:left;"> 32.3% </td> <td style="text-align:left;"> 32.4% </td> <td style="text-align:left;"> 67.8% </td> <td style="text-align:left;"> 64.5% </td> </tr> <tr> <td style="text-align:left;"> Rest </td> <td style="text-align:left;"> 35.5% </td> <td style="text-align:left;"> 33.8% </td> <td style="text-align:left;"> 71.5% </td> <td style="text-align:left;"> 65.5% </td> </tr> </tbody> </table> --- # Reproducibility, Transparency, Adaptability .pull-left[ ### Notebooks A data Notebook is a document that integrates explanatory text and data analysis. In its crudest form this could be an Excel spreadsheet with embedded comments. At the other end of the spectrum are R or Python Notebooks. In fact, this presentation is an R notebook and [lives on GitHub](https://github.com/mountainMath/presentations/blob/master/UBC_GIS_Day_2019.Rmd). It contains all the code to reproduce the graphs in the presentation. ] .pull-right[ ### APIs In order to be reproducible, any analysis should ship with code and the data. But that's not very adaptable. To be adaptable, the data should come through APIs. That way one can easily make changes that requires slightly different data, e.g. use related census variables, other time frames or geographic regions. ] -- .center[**This greatly accelerates analysis.**] -- I will leave you with a quiz questions. ??? This is key to building an ecosystem of people and groups that collaborate to advance understanding of civic issues. Opening up your analysis for everyone to see and pluck apart might be uncomfortable at first, but it's essential to take the discussion to the next level. It increases transparency and trust, and allows others to build on your work. --- class: center inverse ## What do the colours represent? <!-- -->