-
Notifications
You must be signed in to change notification settings - Fork 12
/
WebinarSneakPeek.Rmd
344 lines (260 loc) · 11.9 KB
/
WebinarSneakPeek.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---
title: "Webinar Sneak Peek"
author: "Joy Payton"
date: "3/18/2019"
output:
html_document:
toc: yes
toc_float: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)
library(printr)
```
## Get An Interesting GeoJSON File
Geospatial data can be expressed in maps, which can offer amazing levels of data density.
You're probably familiar with JSON, which is frequently used to store and pass data between applications. GeoJSON applies JSON structure to geospatial data in a single JSON file. Let's get data about child blood lead levels from the City of Philadelphia:
```{r}
library(rgdal)
philly_lead_map <- readOGR('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=geojson&skipfields=cartodb_id')
```
## Sanity Checks
Let's take a peek at the tabular data associated with this geospatial object (not the actual map)
```{r}
head(philly_lead_map@data)
summary(philly_lead_map@data)
```
And let's take a quick peek at the map, to make sure there are no initial, obvious problems:
```{r}
library(leaflet)
library(leaflet.extras)
leaflet(philly_lead_map) %>%
setView(lng = mean(philly_lead_map@bbox['x',], na.rm=TRUE),
lat = mean(philly_lead_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons()
```
Well, the defaults are ugly, but worse, we see some "holes" in our map. This problem was also evident in the data, which had too few rows.
We're missing some tracts, because Philly has 384 tracts, and this data only has 380. Clearly some tracts were not surveyed. Let's get all of Philly's tracts, since we want to map the entire city, and provide imputation as needed. We can get complete GeoJSON file for all the Census tracts in Philly.
## Fill in any Geospatial Holes
```{r}
full_philadelphia_map <- readOGR('http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson')
```
What kind of data is in this map?
```{r}
head(full_philadelphia_map@data)
```
Oh, okay, it has Census Tract data. Is it complete? Let's check... does our data frame now have 384 rows?
```{r}
nrow(full_philadelphia_map@data)
```
Let's do a quick mapping sanity check. We'll change the default color and line width settings this time around:
```{r}
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = "white",
fillOpacity = 1,
label = full_philadelphia_map@data$NAMELSAD10
) %>%
suspendScroll() # Handy to stop those accidental "scrolling by" zooms
```
This map looks great, and the tabular data might prove useful, too, for things like a "prettier" display of the Census Tract name, instead of just a long number. Let's combine our lead data into this fuller map.
## Merging Tabular Data
**WARNING**
`merge` will reorder your rows, and that breaks the relationship between the rows of tabular data and the corresponding polygons. Not what you want! So we're going to keep track of the order thanks to the helpful "OBJECTID" variable, which is super helpful. If you're working with data that doesn't have a column like this, just add it, so you can keep track of what the original order was.
Merging in this case is pretty simple -- we just have to bring in the lead data and make sure our "hinge" (overlapping field) is set up properly:
```{r}
merged <- merge(x = philly_lead_map@data,
y = full_philadelphia_map@data,
by.x = "census_tract",
by.y = "GEOID10",
all = TRUE)
full_philadelphia_map@data <- merged[order(merged$OBJECTID),]
```
## Plot Color-Coded Data (Choropleth)
Let's see what our lead levels look like without any imputation of missing values:
```{r}
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#aaaaaa")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1,
label = full_philadelphia_map@data$NAMELSAD10
) %>%
suspendScroll()
```
## Impute Missing Data
Let's do some imputation of missing values. We'll use knn imputation in part. To do that, we first identify the neighbors for each tract:
```{r}
library(spdep)
coords <- coordinates(full_philadelphia_map)
tracts <- row.names(as(full_philadelphia_map, "data.frame"))
knn10 <- knn2nb(knearneigh(coords, k = 10), row.names = tracts)
knn10 <- include.self(knn10)
```
Impute data for missing rows.
We have two cases here:
* tracts that were included in the lead data set but have redacted numbers (1-5 positives)
* tracts that were lacking in the lead data set
Let's first think about intelligent imputation on the census tracts that were in the lead data dataset, but with redacted numbers. That's almost all our missing data.
### Redacted Data
We know that the actual number of children testing with high blood lead levels in the case of imputation is either 1, 2, 3, 4, or 5 children, so we know the actual possible percentages. So we can do knn imputation, but if the knn imputation comes in lower than the lowest possible percent, we'll just toss in the lowest possible, and do a similar check with the highest possible percent.
Note that knn averaging will often give us an NA, because there are large areas of Philadelphia with no nearby tracts that were tested. In that case, we can simply impute whatever the percentage is for 3 children testing positive (the median value of all possibilities).
### Completely Missing Data
In the case of completely missing data, let's do a second round of imputation. We'll let the first round go and fill in the gaps of redacted data. Then, we'll have good values for neighbors and can do knn imputation.
### Impute
So, first we'll handle redacted data:
```{r}
for (row in 1:nrow(full_philadelphia_map@data)) {
# is this row intentionally redacted?
if (!is.na(full_philadelphia_map@data$data_redacted[row]) &
full_philadelphia_map@data$data_redacted[row] == 1) {
# lowest possible pct
low_pct <- (1/full_philadelphia_map@data$num_screen[row])*100
# highest possible pct
high_pct <- (5/full_philadelphia_map@data$num_screen[row])*100
# the "expected value" or median (and mean) number of kiddos with high lead
median_pct <- (3/full_philadelphia_map@data$num_screen[row])*100
# take the mean of our neighbors
knn_pct <- mean(full_philadelphia_map@data$perc_5plus[unlist(knn10[row])], na.rm=TRUE)
if (is.na(knn_pct)) {
# no way to determine a good guess by neighbors? Choose median.
full_philadelphia_map@data$perc_5plus[row] <- median_pct
}
else if (!is.na(low_pct) & !is.na(knn_pct) & low_pct > knn_pct) {
# knn lower than lowest? Use lowest.
full_philadelphia_map@data$perc_5plus[row] <- low_pct
}
else if (!is.na(high_pct) & !is.na(knn_pct) & high_pct < knn_pct) {
# knn higher than highest? Use highest
full_philadelphia_map@data$perc_5plus[row] <- high_pct
}
else {
# Otherwise, use knn.
full_philadelphia_map@data$perc_5plus[row] <- knn_pct
}
}
}
```
Now let's handle the completely missing data:
```{r}
for (row in 1:nrow(full_philadelphia_map@data)) {
if (is.na(full_philadelphia_map@data$perc_5plus[row])) {
full_philadelphia_map@data$perc_5plus[row] <- mean(full_philadelphia_map@data$perc_5plus[unlist(knn10[row])], na.rm=TRUE)
}
}
```
## Re-map that Jawn
```
jawn
/jôn/
noun DIALECT US
(chiefly in eastern Pennsylvania) used to refer to a thing, place, person, or event that one need not or cannot give a specific name to.
"these jawns are very inexpensive"
```
```{r}
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1
) %>%
suspendScroll()
```
## Enrich!
And here we want to pull in our local file, which is a simplified version of data from the American Community Survey conducted by the Census Bureau. Let's see what is contains.
```{r}
economic_data <- read.csv("../Data/philly_census.csv")
head(economic_data)
```
This is selected economic characteristics of various census tracts. Let's combine the data here with our map, and use labels to allow people to understand the data better:
```{r}
merged_again <- merge(x=full_philadelphia_map@data,
y=economic_data,
by = "census_tract",
all = TRUE)
full_philadelphia_map@data <- merged_again[order(merged_again$OBJECTID),]
str(full_philadelphia_map@data)
```
Let's create some useful labels:
```{r}
labels <- sprintf(
"<strong>%s</strong><br/>
Families Below Poverty Line (%%): %g <br/>
Children With High Blood Lead Levels (%%): %g",
full_philadelphia_map@data$NAMELSAD10,
full_philadelphia_map@data$pct_families_below_poverty_line,
full_philadelphia_map@data$perc_5plus
) %>% lapply(htmltools::HTML)
```
```{r}
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")
) %>%
suspendScroll()
```
But this doesn't really show us the impact of poverty overall. Ideally we'd love to be able to have a dual-purpose choropleth. Let's do that now, using the keyword "group" and some layer controls.
```{r}
poverty_palette <- colorBin("Reds", domain = full_philadelphia_map@data$pct_families_below_poverty_line, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 0.5,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"),
group = "Lead Level"
) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~poverty_palette(full_philadelphia_map@data$pct_families_below_poverty_line),
fillOpacity = 1,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"),
group = "Poverty Level"
) %>%
addLayersControl(
baseGroups = c("Lead Level", "Poverty Level"),
options = layersControlOptions(collapsed = FALSE)
) %>%
suspendScroll()
```