-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03b_dplyr_joins.qmd
383 lines (260 loc) · 12.4 KB
/
03b_dplyr_joins.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: "Joining Data"
subtitle: "Data Munging"
---
```{r}
#| echo: false
#| message: false
library(dplyr)
raw_data_path <- here::here("data-raw", "wood-survey-data-master")
individual_paths <- fs::dir_ls(fs::path(raw_data_path, "individual"))
load("data-raw/desktop_indiv_paths.RData")
individual <- readr::read_csv(
file = fs::path(raw_data_path, "vst_individuals.csv"),
show_col_types = FALSE
)
```
Let's say **our analysis requires that we can geolocate every individual in our analytic data**.
As we've discussed, the various tables we downloaded hold different information collected during the various survey events.
- Plot level metadata
- Individual level tagging metadata
- Individual level repeated measurement data (although we only have a single measurement event per individual in our data set).
**Currently, only the plot is geolocated**, the data being **contained in `vst_perplotperyear.csv` columns `decimalLatitude` and `decimalLongitude`.**
The **location of each individual stem is defined in `vst_mappingandtagging.csv`.**
A **number of variables are involved**, including:
- **`pointID`** which identifies **a point on a 10m cell grid centred around `decimalLatitude` and `decimalLongitude`**
- **`stemDistance` and `stemAzimuth`** which define the **location of a stem, relative to the location of `pointID`**. The full method used to locate individual stems is detailed in `data-raw/wood-survey-data-master/methods/NEON_vegStructure_userGuide_vA.pdf`.
::: {.callout-note collapse="true"}
## Browse `NEON_vegStructure_userGuide_vA.pdf`
```{r}
#| echo: false
knitr::include_url("data-raw/wood-survey-data-master/methods/NEON_vegStructure_userGuide_vA.pdf")
```
:::
```{r}
#| echo: false
knitr::include_graphics("assets/NEON_point_grid.png")
```
So, **to geolocate our individuals, we need to join information from `vst_perplotperyear.csv` and `vst_mappingandtagging.csv` into our `individual` tibble.**
We use the **family of `*_join` functions** in `dplyr` to merge columns from different tibbles according to values in shared columns.
## Join Basics
There are a **number of joins we can perform with `dplyr`.**
Let's have a look at a few of them with a simple example using some `dplyr` in-built data:
```{r}
band_members
band_instruments
```
The **only variable shared between the two tables is `name` so this is the only variable we can perform joins over**. By default, any `*_join` function will try to merge on the values of any matched columns in the tables being merged.
### Inner joins
```{r}
band_members %>% inner_join(band_instruments)
```
**`inner_join`** has merged all three unique columns across the two tables into a single tibble.
It has only kept the rows in which **`name` values had a match in both tables**. In this case only data about `John` and `Paul` was contained in both tables.
### Left joins
```{r}
band_members %>% left_join(band_instruments)
```
`left_join` joins on the `names` in the left hand table and **appends any rows from the right hand table in which values in `name` match**. In this case, there is no data for `Keith` in `band_members` so he is ignored completely. There is also no match for `Mick` in `band_instruments` so `NA` is returned for `plays` instead.
### Right joins
```{r}
band_members %>% right_join(band_instruments)
```
`right_join` on the other hand **joins on the `name` in the right hand table**. In this case, `Mick` is dropped completely `Keith` gets `NA` for band.
```{r}
band_members %>% full_join(band_instruments)
```
Finally, a `full_join` joins on all unique values of `name` found across the two tables, returning `NA` where there are no matches between the two tables.
## Joining our tables with `dplyr`
## Join `vst_mappingandtagging.csv` data
Let's start by merging data from `vst_mappingandtagging.csv`. Let's read the data in.
```{r, message=FALSE}
maptag <- readr::read_csv(
fs::path(raw_data_path, "vst_mappingandtagging.csv")
)
```
This data set contains **taxonomic and within-plot location metadata on individuals** collected during mapping and tagging. There is **one row per individual** in the data set.
```{r}
names(maptag)
```
Let's see how many matches in column names we have between the two datasets
##### Challenge: Finding column name matches in two tables
::: {.callout-note appearance="minimal" collapse="true"}
## Given the two tables we are trying to join, can you write some code that checks which column names in `individual` have matches in `maptag` column names?
```{r}
names(individual)[names(individual) %in% names(maptag)]
```
:::
#### Default `left_join`
Because **we want to match the rest of the tables to our individual data, we use `left_join()`** and supply **`individual` as the first argument** and **`maptag` as the second**.
```{r}
individual %>%
dplyr::left_join(maptag)
```
Great we have a merge!
Looks successful right? **How do we really know nothing has gone wrong though?**
Remember, **to successfully merge the tables, the values in the columns the tables are being joined on need to have corresponding values across all join columns to be linked successfully, otherwise it will return `NA`s**. So, although our code ran successfully, it may well not have found any matching rows in `maptag` to merge into `individual`.
To check whether things have worked, **we can start with inspecting the output for the columns of interest**, in this case the `maptag` columns we are trying to join into `individual`.
::: callout-tip
When working interactively and testing out pipes, you can pipe objects into `View()` for quick inspection. If you provide a character string as an argument, it is used as a name for the data view tab it launches
:::
```{r}
#| eval: false
individual %>%
dplyr::left_join(maptag) %>%
View("default")
```
```{r, echo=FALSE}
knitr::include_graphics("assets/view_default_join.png")
```
Clearly this has not worked! We need to start digging into why but **we don't want to have to keep manually checking whether it worked** or not. Enter **DEFENSIVE PROGRAMMING**.
## Defensive programming with data
As I mentioned in the Data Management Basics slides, **[`assertr`](https://github.com/ropensci/assertr) is a useful package for including validation checks in our data pipelines.**
In our case, **we can use `assertr` function `assert` to check that certain columns of interest (`stemDistance`, `stemAzimuth`, `pointID`) are joined successfully (i.e. they contain no `NA` values)**.
::: callout-caution
Note that *this only works because I know for a fact that there is data available for all individuals*. There **may be situations in which `NA`s are valid missing data, in which case this would not be an appropriate test.**
:::
Let's introduce this check into our pipeline:
```{r}
#| error: true
individual %>%
dplyr::left_join(maptag) %>%
assertr::assert(assertr::not_na, stemDistance, stemAzimuth, pointID)
```
The `assertr::assert` function applies the predicate `assertr::not_na` on columns `stemDistance`, `stemAzimuth`, `pointID` which checks that each column does not contain `NA` values.
By appending it onto our pipeline, the function will through an error if the predicate assertion returns `FALSE` in any of the columns `stemDistance`, `stemAzimuth`, `pointID`. If the assertion is `TRUE`, the pipeline will continue as expected.
**By including this check, I don't have to guess or manually check whether the merge has been successful. The code will just error if it hasn't** :raised_hands:.
## Debugging hidden mismatches:
I've shown **the most minimal implementation in which `dplyr` does a lot of the guessing for us and tries to join on all matched columns**. But often, that can generate table mismatches
**We know that the only column we are interested in matching on is `individualID`**. We want to get the mapping associated with each individual, regardless of when the mapping was collected.
**Let's be specific about which variables we want to join on through argument `by`.**
```{r}
individual %>%
dplyr::left_join(maptag,
by = "individualID"
) %>%
assertr::assert(assertr::not_na, stemDistance, stemAzimuth, pointID)
```
**Excellent! Our code runs and our resulting merged tibble contains data for all the variables we are interested in!**
However, on closer inspection, **we've ended up with some odd new columns, `uid.x` and `uid.y` and `eventID.x` and `eventID.y`!**
That's because **those columns are also present in both our tables but we are not explicitly joining them**. They are retained and each suffixed with `.x` & `.y` by default, to make them unique.
So, what about these duplicate columns. Do we need them?
### Dropping unnecessary columns
With respect to `eventID`, **we're not really interested in the mapping `eventID`s so we can just drop that column from `maptag`.**
```{r}
maptag <- select(maptag, -eventID)
individual %>%
dplyr::left_join(maptag,
by = "individualID"
) %>%
assertr::assert(assertr::not_na, stemDistance, stemAzimuth, pointID)
```
### Retaining useful metadata
On the other hand, **`"uid` contains unique identifiers for each observation in their respective table and could be useful metadata to store**, enabling us to trace the provenance of individual values to the original data. So rather than remove them, let's retain both `uid`, one for each table.
**We can give more informative suffixes using argument `suffix`**. In our case, I want the `individual` column to stay as `uid` and the `maptag` column to get the suffix `_map`.
```{r}
individual %>%
dplyr::left_join(maptag,
by = "individualID",
suffix = c("", "_map")
) %>%
assertr::assert(assertr::not_na, stemDistance, stemAzimuth, pointID)
```
Nice!
## Join `vst_perplotperyear.csv`
Now let's carry on and **join the perplot data**.
First let's read it in and have a look at the data.
```{r}
#| message: false
perplot <- readr::read_csv(
fs::path(raw_data_path, "vst_perplotperyear.csv"),
show_col_types = FALSE)
perplot
```
Similarly to `maptag`, **we want to exclude `eventID` and suffix the `uid` column**. This time, however, we will be joining by `plotID`**
Let's also **move our validation test to the end and add additional columns from `perplot` we want to check to it**, i.e. `stemDistance`, `stemAzimuth`, `pointID`.
```{r}
perplot <- perplot %>% select(-eventID)
individual %>%
dplyr::left_join(maptag,
by = "individualID",
suffix = c("", "_map")
) %>%
dplyr::left_join(perplot,
by = c("plotID"),
suffix = c("", "_ppl")
) %>%
assertr::assert(
assertr::not_na, decimalLatitude,
decimalLongitude, plotID, stemDistance, stemAzimuth, pointID
)
```
Awesome!! It's worked! :tada:
## Using the assignment pipe
Now that we are happy with our data we can use a new operator, the **assignment pipe (`%<>%`)**.
**This allows us to both pipe an object forward into an expression and also update it with the resulting value.**
::: callout-caution
Note that this operator can have unexpected results if you run the same code multiple times in an interactive session. Use it in scripts that will be run from top to bottom only once.
:::
```{r}
individual %<>%
dplyr::left_join(maptag,
by = "individualID",
suffix = c("", "_map")
) %>%
dplyr::left_join(perplot,
by = c("plotID"),
suffix = c("", "_ppl")
) %>%
assertr::assert(
assertr::not_na, decimalLatitude,
decimalLongitude, plotID, stemDistance, stemAzimuth, pointID
)
```
```{r}
#| echo: false
saveRDS(individual, here::here("data-raw", "individual_joined.rds"))
```
## Update `individual.R`
Let's **update our `individual.R` script again with the additional code we've just written for reading in and merging the `maptag` and `perplot` data**.
Add the following code and comments to `individual.R`:
```{r}
#| eval: false
# Combine NEON data tables ----
# read in additional tables
maptag <- readr::read_csv(
fs::path(
raw_data_path,
"vst_mappingandtagging.csv"
),
show_col_types = FALSE
) %>%
select(-eventID)
perplot <- readr::read_csv(
fs::path(
raw_data_path,
"vst_perplotperyear.csv"
),
show_col_types = FALSE
) %>%
select(-eventID)
# Left join tables to individual
individual %<>%
left_join(maptag,
by = "individualID",
suffix = c("", "_map")
) %>%
left_join(perplot,
by = "plotID",
suffix = c("", "_ppl")
) %>%
assertr::assert(
assertr::not_na, stemDistance, stemAzimuth, pointID,
decimalLongitude, decimalLatitude, plotID
)
```
***
We can now move on to geolocate our individuals!
```{r, echo=FALSE}
knitr::include_url("assets/cheatsheets/data-transformation.pdf")
```