class: center, middle, inverse, title-slide # Taking the Pain Out of Census and Other Public Data ## Towards Frictionless, Transparent, Reproducible and Adabtable Analysis ### Jens von Bergmann ### 2018/01/19 --- # Goals for this talk: 1) Get people interested in census and other public data 2) Showcase some tools to work with census and other public data 3) Highlight some obstacles when working with census data 4) Help build a culture around civic data analysis --- # So much data, so little time Data analysis and communication of results take a lot of time. So I built tools that facilitate and greatly speed this up while increasing *transparency*, *reproducability* and *adaptablility*. I want to showcase some of these tools dealing with * Property Level Data * Census Data * CMHC Data * ... --- # Property Level Data Property level data lets us explore civic questions at the individual property level. Great detail, but missing demographic variables. Also, sketchy coverage, lots of important variables aren't publicly accessible. Research institutions have access to more detailed and comprehensive data, but it's cumbersome to work with. Because of barriers to access we are mostly stuck at doing simple descriptive analysis visualizations. Examples: * [Assessment (and related) Data](https://mountainmath.ca/map/assessment) * [Teardown Index data story](https://mountainmath.ca/teardowns) * [Tax Density](https://mountainmath.ca/assessment_gl/map) * [Houses and Dirt Explorer](https://mountainmath.ca/assessment/split_map) ??? Sadly, most useful data is not publicly available. Can be accessed for research through cumbersome process and results can't be shared unless dropping detail and aggregated to high level. --- background-image: url("phrn_files/figure-html/houses_and_dirt.png") background-position: 50% 50% background-size: 100% class: center, bottom, inverse # <a href="https://mountainmath.ca/assessment/split_map" target="_blank">Vancouver Houses And Dirt Demo</a> ??? A visualization that separates land and improvement values and allows stepping through time and exploring the effect of missing middle housing on prices. --- # CensusMapper CensusMapper is my answer to the inaccessibility of census data by non-experts. It allows instant and flexible mapping of census data. Canada wide. Maps can be narrated, saved and shared. By anyone. --- background-image: url("https://doodles.mountainmath.ca/images/net_van.png") background-position: 50% 50% background-size: 100% class: center, bottom, inverse # <a href="https://censusmapper.ca/maps/731" target="_blank">CensusMapper Demo</a> ??? Lots of hidden features too that aren't accessible to general public. Don't have the resources to make them more user-friendly and release to public free to use. --- # Maps aren't analysis CensusMapper has APIs to facilitate deeper analysis. Open for all to use. [`cancensus`](https://github.com/mountainMath/cancensus) is an R package that seamlessly integrates census data into data analysis in R. Let's try and understand the effects of the net migration patterns by age on the age distribution. ??? While we do need better data, we don't make good use of the data we already have. What's needed most is analysis. --- # Age pyramids How does the net migration effect the age distribution in each municipality? ![](working_with_census_files/figure-html/canada_age-1.svg)<!-- --> -- How to get the data to easily make these graphs? ??? Explain how net migration patterns lead to different age distributions. --- background-image: url("images/api_tool.png") background-position: 50% 50% background-size: 100% class: center, bottom, inverse # <a href="https://censusmapper.ca/api" target="_blank">CensusMapper API Demo</a> --- # Putting the "open" into StatCan open data * CensusMapper made StatCan census data accessible to non-experts. For mapping and browsing. * API extensions for non-mapping purposes make custom data extracts accessible to everyone. * [`cancensus`](https://github.com/mountainMath/cancensus) is an R wrapper for these APIs that makes analysis accessible to everyone. -- .center[**Well, maybe not everyone. But everyone in this room.**] --- # Non-census data CMHC provides great housing-related data. It's a pain to download, so I built an [pseudo-API in R](https://github.com/mountainMath/cmhc). ```r cmhc <- get_vacancy_rent_data(c("Vancouver","Toronto","Calgary","Winnipeg"),"CMA") ggplot(cmhc, aes(x = Year, y = Rate, color = Series)) + vanancy_plot_options + geom_line() + geom_point() + facet_wrap("city", ncol=2) ``` ![](working_with_census_files/figure-html/unnamed-chunk-3-1.svg)<!-- --> ??? CMHC has recently made finer data available. Sadly no APIs, but we can hack their data portal to speed up analysis. So we built a pseudo-API to consume it. This graph shows the primary market vacancy rate and the fixed-sample rent change on the same axis. We note the clear inverse relationship between the two, with sometimes strong responses in non rent-controlled Calgary. And yes, rents do drop when the vacancy rate is high. --- # Keeping the Census fresh The 2016 census data is still quite up to date. But the clock is ticking, how can we keep it fresh? ![](working_with_census_files/figure-html/unnamed-chunk-4-1.svg)<!-- --> CANSIM data includes census undercounts. We can use relative changes in CANSIM data to estimate changes in Census data. ??? A retroactive look. --- # Where in Vancouver did people move to? CMHC building data can tell us where people go, we can use past censuses migration data to make educated guesses about demolition rates and the demographics of the new units. ![](working_with_census_files/figure-html/unnamed-chunk-5-1.svg)<!-- --> ??? Mixing in concurrent data sources like CMHC and CANSIM can extend the useful life of census data. Data APIs designed to be easily integrated facilitate this. And APIs make it simple to update analysis when new data becomes available. --- # Reality Check Census and CMHC timelines of completions don't always line up. CMHC does not track demolitions well. ![](working_with_census_files/figure-html/unnamed-chunk-6-1.svg)<!-- --> ??? Allows us to estimate where people moved to, and who these people are --- # Rental Listings Data Another important data source to inform how our city is changing is rental data. ![](working_with_census_files/figure-html/price_map-1.svg)<!-- --> ??? Only showing data for areas with at least 10 listings. --- # Challenges One of the biggest challenges I face on a daily basis is need for robust and easily adaptable ecological inference models. And [example](https://www.washingtonpost.com/news/monkey-cage/wp/2016/12/02/donald-trump-did-not-win-34-of-latino-vote-in-texas-he-won-much-less/?utm_term=.6d12061de8c4) Problem: Given the number of people that voted for Clinton and Trump in each precinct, as well as the number of Hispanic and White eligible voters and overall voter, turnout in each precinct, estimate the Latino turnout and share of Lations that voted for Trump. We know the relationship between these quantities at the aggregate (ecological) level, but want to conclude something about the relationship at the individual voter level. --- # A Simple Example Consider an example where we know the answer. Take the number of households spending more than 30% of income on shelter in each census tract in Metro Vancouver, as well as the share of Owner and Tenant households. We want to know what share of Owner and Renter households each spend more than 30% of income on shelter. ![](working_with_census_files/figure-html/unnamed-chunk-8-1.svg)<!-- --> --- # Ecological Inference Ecological inference builds a distribution over the space of our quantities of interest, the share of owners `\(\beta^w\)` and the share of renters `\(\beta^b\)` spending more than 30% of income on shelter. ![](working_with_census_files/figure-html/unnamed-chunk-10-1.svg)<!-- --> --- ![](working_with_census_files/figure-html/unnamed-chunk-11-1.svg)<!-- --> Tenants spending > 30% of income on shelter: 41%, Goodman Reg: 51.5%, Actual: 43.5% Owners spending > 30% of income on shelter: 26.8%, Goodman Reg: 20.4%, Actual: 25.4% --- # Mapping the Residuals A geographic check of the residuals reveals where we went wrong. In regular examples we don't have this information and have to rely on other tests to understand the presence of biases in our model and refine it. ![](working_with_census_files/figure-html/unnamed-chunk-14-1.svg)<!-- --> --- # Reproducibility, Transparancy, Adaptability We need to adopt a more collaborative approach to understanding civic issues. .pull-left[ ### Notebooks A data Notebook is a document that integrates explanatory text and data analysis. In its crudest form this could be an Excel spreadsheet with embedded comments. At the other end of the spectrum are R or Python Notebooks. In fact, this presentation is an R notebook and [lives on GitHub](https://github.com/mountainMath/presentations/blob/master/working_with_census.Rmd). It contains all the code to reproduce the graphs in the presentation. ] .pull-right[ ### APIs In order to be reproducible, any analysis should ship with code and the data. But that's not very adaptable. To be adaptable, the data should come through APIs. That way one can easily make changes that requires slightly different data, e.g. use related census variables, other time frames or geographic regions. ] -- .center[**Should become standard as base for public policy.**] -- I will leave you with some quiz questions. ??? This is key to building an ecosystem of people and groups that collaborate to advance understanding of civic issues. Opening up your analysis for everyone to see and pluck apart might be uncomfortable at first, but it's essential to take the discussion to the next level. It increases transparency and trust, and allows others to build on your work. --- # Question 1 Has affordability in the City of Vancouver gotten better or worse? -- ![](working_with_census_files/figure-html/unnamed-chunk-17-1.svg)<!-- --> Diverging Narratives that need to be reconciled: At ecological level, it looks like things got worse. At individual levels, it looks like like they got better. --- # Question 2 Which city has higher incomes, Toronto or Vancouver? -- ![](working_with_census_files/figure-html/income-1.svg)<!-- --> --- # Question 3 What share of Toronto and Vancouver residential properties are owned by owner-occupiers, investors living in Canada and investors living abroad? -- ![](working_with_census_files/figure-html/investors-1.svg)<!-- --> --- class: center, middle Thanks for bearing with me. These slides are online at https://mountainmath.ca/working_with_census.html and the R notebook that generated them includes the code that pulls in the data and made the graphs and [lives on GitHub](https://github.com/mountainMath/presentations/blob/master/working_with_census.Rmd). ??? Our discussion rarely move beyond presenting a simple quotient. We need to move beyond viewing the world through single census variables or simple percentages and dig deeper into the very complex issues we are facing.