An ecosystem of R packages to access and process Canadian data

class: canverse, title-slide

.title[
# An ecosystem of R packages to access and process Canadian data
]
.subtitle[
## CANSSI Conference
]
.author[
### Jens von Bergmann
]
.date[
### 2022-11-10
2:30 - 3:00 pm ET
]

---

<img src="images/canverse.png" width="500px" style="float:right;margin-left=10px;">
# Overview

We have a lot of high-quality data in Canada, and are continuously getting more and better data. What we are lacking is an **ecosystem of data analysis**.

To enable this we require ways to access and import data in a robust, pinpointed, and reproducible way.

--
Different users have different needs, need a variety of tools.

* CensusMapper is a versatile mapping tool for Canadian census data, can be used by general public.
* For deeper analysis we have R packages as a uniform interface to data ingestion and to facilitate dealing with common data challenges.

---
background-image: url("https://doodles.mountainmath.ca/images/net_van.png")
background-position: 50% 50%
background-size: 100%
class: center, bottom, inverse

# <a href="https://censusmapper.ca/maps/731" target="_blank">CensusMapper Demo</a>

???
Built in 2015. Lots of hidden features too that aren't accessible to general public. Don't have the resources to make them more user-friendly and release to public free to use.

---
class: short-title
# **cancensus**
.pull-left[
Maps aren’t analysis

The [**cancensus** R package](https://mountainmath.github.io/cancensus/) interfaces with the CensusMapper API server. It can be queried for
- census geographies for 1996, 2001, 2006, 2011, 2016, and 2021
- census data for 1996 through 2021
- hierarchical metadata of census variables
- some non-census data that comes on census geographies, e.g. T1FF taxfiler data

A slight complication, the [`cancensus` package](https://mountainmath.github.io/cancensus/) needs an API key, freely available at [CensusMapper](https://censusmapper.ca/users/sign_up).

CensusMapper also has an <a href="https://censusmapper.ca/api" target="_blank">API GUI</a> to facilitate selecting data.
]
.pull-right[
<img src="https://raw.githubusercontent.com/mountainMath/cancensus/master/images/cancensus-sticker.png" alt="cancensus" style="height:500px;margin-top:-80px;">
]

---
class: medium-code short-title
## Census data

```r
poverty_data <- get_census("2021",regions=list(CMA="24462"),vectors=c(lico_at="v_CA21_1085"),
                           geo_format="sf",level="CT")
ggplot(poverty_data,aes(fill=lico_at/100)) +  
  geom_sf() + lico_map_theme +
  labs(title="Montrèal share of people in poverty (LICO-AT)") 
```

---
class: medium-code short-title
## Adapting for other region

```r
get_census("2021",regions=list(CSD="3520005"),vectors=c(lico_at="v_CA21_1085"),
                           geo_format="sf",level="CT") |> 
  ggplot(aes(fill=lico_at/100)) +
  geom_sf() +  lico_map_theme + 
  labs(title="Toronto share of people in poverty (LICO-AT)") 
```

---
# cansim
.pull-left[
The [`cansim` R package](https://mountainmath.github.io/cansim/) interfaces with the StatCan NDM that replaces the former CANSIM tables. It can be queried for
- whole tables
- specific vectors
- data discovery searching through tables

It encodes the metadata and allows to work with the internal hierarchical structure of the fields. Data tables can be cached locally in an SQLite database for faster querying, which is especially useful for large rarely-updating tables.
]
.pull-right[
<img src="https://raw.githubusercontent.com/mountainMath/cansim/master/images/cansim-sticker.png" alt="cansim" style="height:500px;margin-top:-80px;">
]

---
class: medium-code small-table short-title
# Income over time
Consider individual income by age group over time, we search for suitable data series.

```r
search_cansim_cubes("Income of individuals by age group") |> select(1,2,6,7) |> knitr::kable()
```

|cansim_table_number |cubeTitleEn                                                                                                         |cubeStartDate |cubeEndDate |
|:-------------------|:-------------------------------------------------------------------------------------------------------------------|:-------------|:-----------|
|11-10-0239          |Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas |1976-01-01    |2020-01-01  |

--
.tiny-output[
Can check metadata with

```r
get_cansim_table_overview("11-10-0239")
```

```
## Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas
## CANSIM Table 11-10-0239
## Start Reference Period: 1976-01-01, End Reference Period: 2020-01-01, Frequency: Annual
## 
## Column Geography (21)
## Atlantic provinces, Quebec, Ontario, Prairie provinces, British Columbia, Québec, Quebec, Montréal, Quebec, Ottawa-Gatineau, Ontario/Quebec, Toronto, Ontario, Winnipeg, Manitoba, ...
## 
## Column Age group (8)
## 16 to 24 years, 25 to 54 years, 55 to 64 years, 65 years and over, 25 to 34 years, 35 to 44 years, 45 to 54 years, 16 years and over
## 
## Column Sex (3)
## Males, Females, Both sexes
## 
## Column Income source (16)
## Market income, Government transfers, Employment income, Investment income, Retirement income, Other income, Wages, salaries and commissions, Self-employment income, COVID-19 benefits, Old Age Security (OAS) and Guaranteed Income Supplement (GIS), ...
## 
## Column Statistics (5)
## Number of persons, Number with income, Aggregate income, Average income (excluding zeros), Median income (excluding zeros)
```
]

---
# Querying data from the StatCan NDM

Load entire table via `get_cansim`, data is cached for current session.

```r
income_data <- get_cansim("11-10-0239") |>
  filter(GEO=="Canada", Sex=="Both sexes",
         Statistics=="Median income (excluding zeros)",
         !(`Age group` %in% c("16 years and over","65 years and over")),
         `Income source`=="Total income")
```

Alternatively cache data across sessions in SQLite via `get_cansim_sqlite`. Almost identical workflow but also need to "collect" the data.

```r
income_data <- get_cansim_sqlite("11-10-0239") |>
  filter(GEO=="Canada", Sex=="Both sexes",
         Statistics=="Median income (excluding zeros)",
         !(`Age group` %in% c("16 years and over","65 years and over")),
         `Income source`=="Total income") |>
  collect_and_normalize(disconnect = TRUE)
```

This method performs automatic checks to see if the locally cached SQLite is out of date.

---
class: medium-code short-title
## Income by age groups

```r
ggplot(income_data, aes(x=Date, y=VALUE, color=`Age group`)) + 
  geom_line() + line_theme +
  labs(title="Median income by age group in Canada", y=unique(income_data$UOM))
```

---
class: medium-code
## Mixing census data with StatCan Tables
The packages are designed to easily integrate data from different sources.

```r
get_census("2016", regions=list(CMA="35535"), geo_format = 'sf', level="CT") |>
  left_join(get_cansim("11-10-0074") |> select(GeoUID,`D-index`=VALUE),  by="GeoUID") |>
  ggplot(aes(fill=`D-index`)) + 
  geom_sf() + d_index_theme +
  labs(title="Income divergence index 2017", caption="StatCan Table 11-10-0074")
```

---

# **cmhc**
.pull-left[
CMHC has a wealth of housing data, the [`cmhc` R package](https://mountainmath.github.io/cmhc/) interfaces with the CMHC Housing Market Information Portal (HMIP) to provide programmatic and reproducible access to housing data.

The functionality is limited because fo the design of the HMIP, which is more of a web interface than a data portal. The package has an interactive query builder `select_cmhc_table()` that can be run the console to help build queries for CMHC data. 
]
.pull-right[
<img src="https://raw.githubusercontent.com/mountainMath/cmhc/master/images/cmhc-sticker.png" alt="cmhc" style="height:500px;margin-top:-80px;">
]

---
class: medium-code short-title
# Housing completions in Toronto

```r
completions <- get_cmhc(survey = "Scss", series = "Completions", 
                        dimension = "Intended Market", 
                        breakdown = "Historical Time Periods", 
                        frequency = "Annual", geo_uid = "35535")

ggplot(completions,aes(x=Date,y=Value,colour=`Intended Market`)) +
  geom_point(shape=21) + geom_line() +
  scale_y_continuous(labels=scales::comma) +
  labs(title="Housing completions",y="Annual number of units", 
       x=NULL,caption="CMHC Scss")
```

---
class: medium-code short-title
# Current under construction

```r
get_census("2016", regions=list(CMA="35535"), geo_format = 'sf', level="CT") |>
  left_join(get_cmhc(survey = "Scss", series = "Under Construction", 
                     dimension = "Intended Market", breakdown = "Census Tracts", 
                     geo_uid = "35535",year="2022",month="09") |>
              filter(`Intended Market`=="All") |> select(GeoUID,Value),  
            by="GeoUID") |>
  ggplot(aes(fill=Value)) +
  geom_sf() + under_construction_theme +
  labs(title="Units under construction September 2022", caption="CMHC Scss")
```

---
# What to do when geographies change?
.pull-left[
One common problems with doing analysis with (Canadian) geographic data is that geographies aren't stable over time. There are three ways to deal this this problem:

1. Order a custom tabulation on a constant geography of choice. Best solution, but not always possible. And if possible (e.g. Census data) it costs time and money.
2. Estimate data on a fixed geography, e.g. areal or more refined methods like dasymetric approximation. Fine for gimmicky purposes, but not suited for analysis. Very hard to do this without introducing bias.
3. [**tongfen**](https://mountainmath.github.io/cmhc/): Create a semi-custom tabulation of the data on a slightly coarser least common denominator geography.
]
.pull-right[
<img src="https://raw.githubusercontent.com/mountainMath/tongfen/master/images/tongfen-sticker.png" alt="cmhc" style="height:500px;margin-top:-80px;">
]

---
# TongFen
TongFen (通分) means to convert two fractions to the least common denominator, typically in preparation for further manipulation like addition or subtraction. In English, that’s a mouthful and sounds complicated. But in Chinese there is a word for this, TongFen, which makes this process appear very simple.

--
Analogously, the TongFen package finds the least common denominator geography and aggregates the data. It’s semi-custom tabulations on the fly on a slightly coarser geography.
Three steps:
1. Generate correspondence file for least common geography.
2. Create metadata to specify how to aggregate up data.
3. Build the aggregated spatial dataframe on common geography.

---
class: small-code short-title
# Change in income in Vancouver (no TongFen)

.pull-left.width60[
We use the [CensusMapper API tool](https://censusmapper.ca/api) to assemble the vectors for average income of economic families.

```r
regions <- list(CSD="5915022")
income_vectors <- c("2021"="v_CA21_990", "2016"="v_CA16_4994",
                    "2011"="v_CA11N_2457", "2006"="v_CA06_1803")
get_income_data <- function(year){
  get_census(year,regions=regions, geo_format="sf",level="CT",
             vectors=c(ef_income=income_vectors[[year]])) |>
    mutate(Year=year)
}
```
]
.pull-right.width40[
The wrapper function makes the data import and mapping easy.

```r
income_data <- seq(2006,2021,5) |> 
  as.character() |>
  map_df(get_income_data)
ggplot(income_data) + 
  geom_sf(aes(fill=ef_income)) + 
  income_map_theme+
  labs(fill="Current dollars")
```
]
<img src="canssi_files/figure-html/unnamed-chunk-18-1.png" width="3900" />

---
class: medium-code short-title
# Adjusting for inflation (still no TongFen)
.pull-left[
Vector `v41693271` gives the annual consumer price index for Canada.

```r
inflation <- get_cansim_vector("v41693271") |>
  mutate(Year=strftime(Date,"%Y")) |>
  select(Year,CPI=val_norm) |>
  filter(Year %in% names(income_vectors)) |>
  mutate(CPI=CPI/last(CPI,order_by = Year))
```
]
.pull-right[
We join on and divide by the CPI.

```r
income_data |> 
  left_join(inflation,by="Year") |>
  ggplot() +
  geom_sf(aes(fill=ef_income/CPI)) +
  income_map_theme +
  labs(fill="Constant 2021 dollars")
```

]
<img src="canssi_files/figure-html/unnamed-chunk-21-1.png" width="3900" />

---
class: small-output medium-code
# TongFen to make data comparable
Getting the data on a uniform geography is very easy using the **tongfen** package. Metadata gets built automatically.

```r
meta <- meta_for_ca_census_vectors(income_vectors)
meta
```

```
## # A tibble: 8 × 10
##   variable     label        dataset type     aggregation             units    rule  parent geo_d…¹  year
##   <chr>        <chr>        <chr>   <chr>    <chr>                   <chr>    <chr> <chr>  <chr>   <int>
## 1 v_CA21_990   2021         CA21    Original Average of v_CA21_989   Currency Aver… v_CA2… CA21     2021
## 2 v_CA16_4994  2016         CA16    Original Average of v_CA16_4993  Currency Aver… v_CA1… CA16     2016
## 3 v_CA11N_2457 2011         CA11N   Original Average of v_CA11N_2455 Currency Aver… v_CA1… CA11     2011
## 4 v_CA06_1803  2006         CA06    Original Average of v_CA06_1801  Currency Aver… v_CA0… CA06     2006
## 5 v_CA21_989   v_CA21_989   CA21    Extra    Additive                <NA>     Addi… <NA>   CA21     2021
## 6 v_CA16_4993  v_CA16_4993  CA16    Extra    Additive                <NA>     Addi… <NA>   CA16     2016
## 7 v_CA11N_2455 v_CA11N_2455 CA11N   Extra    Additive                <NA>     Addi… <NA>   CA11     2011
## 8 v_CA06_1801  v_CA06_1801  CA06    Extra    Additive                <NA>     Addi… <NA>   CA06     2006
## # … with abbreviated variable name ¹geo_dataset
```

--
Getting data for the four census years on uniform geography is a simple function call.

```r
unified_income_data <- get_tongfen_ca_census(regions,meta)
```

---
class: medium-code
# Mapping TongFen data

```r
ggplot(unified_income_data) +
  geom_sf(aes(fill=`2021`/`2006`*filter(inflation,Year=="2006")$CPI-1)) +
  tongfen_map_theme + 
  labs(title="Change in average income of economic families") 
```

---
class: medium-code short-title
# Tongfen for US census data

This builds on the [**tidycensus** package](https://walker-data.com/tidycensus/) to ingest US census data and Census Bureau correspondence files. Metadata has to be assembled by hand.
.pull-left.width60[

```r
meta <- bind_rows(
  meta_for_additive_variables(
    "dec2000",c(pop_2000="H011001",hh_2000="H013001")),
  meta_for_additive_variables(
    "dec2010",c(pop_2010="H011001",hh_2010="H013001"))
)

census_data <- get_tongfen_us_census(
  regions = list(state="CA"), 
  meta=meta, level="tract") |>
  mutate(change=pop_2010/hh_2010-pop_2000/hh_2000) |>
  mutate(c=cut(change, c(-Inf,-0.5,-0.3,-0.2,-0.1,
                         0,0.1,0.2,0.3,0.5,Inf))) 
```

```r
ggplot(census_data) +
  geom_sf(aes(fill=c), linewidth=0.05) +  
  us_map_theme + 
  labs(title="Bay area change in average household size 2000-2010")
```
]
.pull-right.width40[
<img src="canssi_files/figure-html/unnamed-chunk-29-1.png" width="2100" />
]

---
class: short-title
# General TongFen
.pull-left.width70[
![](https://mountainmath.github.io/tongfen/articles/polling_districts_files/figure-html/unnamed-chunk-6-1.png)
![](https://mountainmath.github.io/tongfen/articles/polling_districts_files/figure-html/unnamed-chunk-7-1.png)
]
.pull-right.width30[
TongFen can be applied to any geographic data, for example polling district geographies.

Polling districts change from election to election. But polling districts usually follow streets and can be joined up to form a least common denominator geography.

The usefulness of this approach depends how extensively the polling districts change. 
]

---
class: medium-code short-title
# Estimating Candian census data on fixed geographies

TongFen has built-in end-to-end estimation of census data on arbitrary geographies.

.pull-left[

```r
meta <- meta_for_ca_census_vectors(
  c(rent = "v_CA21_4317")) |>
  mutate(downsample="Households")
```
For metadata we also specify how to downsample the data to census block level.

```r
station_data <- tongfen_estimate_ca_census(
  COV_SKYTRAIN_STATIONS, meta=meta,
  level = "DA", na.rm=TRUE,
  intersection_level = "CT",
  downsample_level = "DB") 
```
We specify to determine the relevant geographic extent of the data needed at CT level, use DA level rent data, ignore `NA` values (due to no or small number of rentals in area), and downsample the data to DB level based on households to improve estimation precision.

]
.pull-right[

```r
ggplot(station_data) +
  cov_background +
  geom_sf(aes(fill=rent),alpha=0.7) +
  rent_map_theme
```

<img src="canssi_files/figure-html/unnamed-chunk-33-1.png" width="1590" />
]

---
class: short-title
# Expanding the ecosystem
.pull-left[
### More access to StatCan data under development
* [**statcanXtabs**](https://mountainmath.github.io/statcanXtabs/) to help import and process census cross-tabulations
* [**canbank**](https://mountainmath.github.io/canbank/) to access Bank of Canada data
* [**canpumf**](https://mountainmath.github.io/canpumf/) to access and work with StatCan public use micro data (PUMF).

### Selection of other Canadian data packages by other people and groups:
* [**rcanvec**](https://github.com/paleolimbot/rcanvec) NRCan topographic map data
* [**weathercan**](https://docs.ropensci.org/weathercan/) to query Environment Canada climate and weather data
* [**cesR**](https://hodgettsp.github.io/cesR/) Canadian Election Study Datasets
* [**tidyhydat**](https://docs.ropensci.org/tidyhydat/) Access to Canadian hydrometric data
]
.pull-right[
### Provincial and local data
* [**bcdata**](https://bcgov.github.io/bcdata/) Provincially maintained package for accessing BC Open Data catalogue
* [**opendatatoronto**](https://sharlagelfand.github.io/opendatatoronto/) To access City of Toronto Open Data
* [**VancouvR**](https://mountainmath.github.io/VancouvR/) To access City of Vancouver Open Data

### More resources and examples for using packages introduced here
* [Analyzing Canadian Demographic and Housing Data](https://mountainmath.github.io/canadian_data/), and eBook teaching basic descriptive data analysis using Canadian data in R through a problem-based approach.
* [My blog](https://doodles.mountainmath.ca) with lots of examples using Canadian data and links to code
]

---
class: center, middle, inverse
### Thanks for bearing with me.

These slides are online at [https://mountainmath.ca/canssi/](https://mountainmath.ca/canssi/) and the R notebook that generated them includes the code that pulls in the data and made the graphs and [lives on GitHub](https://github.com/mountainMath/presentations/blob/master/canssi.Rmd).

<hr>

You can find me on Twitter at [@vb_jens](https://twitter.com/vb_jens), or (occasionally) on [Linkedin](https://www.linkedin.com/in/vb-jens/), or (possibly) on Mastodon at [@vb_jens@econtwitter.net](https://econtwitter.net/web/@vb_jens).