-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathworking_with_databases_from_r.qmd
238 lines (182 loc) · 8.34 KB
/
working_with_databases_from_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
# A first analysis using data in a database {#sec-databases_and_r}
{width="250"}
*Artwork by [\@allison_horst](https://x.com/allison_horst)*
Before we start thinking about working with healthcare data spread across a database using the OMOP common data model, let's first do a quick data analysis with R using a simpler dataset held in a database to quickly understand the general approach. For this we'll use data from [palmerpenguins package](https://allisonhorst.github.io/palmerpenguins/), which contains data on penguins collected from the [Palmer Station](https://en.wikipedia.org/wiki/Palmer_Station) in Antarctica.
## Getting set up
Assuming that you have R and RStudio already set up, first we need to install a few packages not included in base R if we don't already have them.
```{r, eval=FALSE}
install.packages("dplyr")
install.packages("dbplyr")
install.packages("ggplot2")
install.packages("DBI")
install.packages("duckdb")
install.packages("palmerpenguins")
```
Once installed, we can load them like so.
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(dbplyr)
library(ggplot2)
library(DBI)
library(duckdb)
library(palmerpenguins)
```
## Taking a peek at the data
The package `palmerpenguins` contains two datasets, one of them called `penguins`, which we will use in this chapter. We can get an overview of the data using the `glimpse()` command.
```{r}
glimpse(penguins)
```
Or we could take a look at the first rows of the data using `head()` :
```{r}
head(penguins, 5)
```
## Inserting data into a database
Let's put our penguins data into a [duckdb database](https://duckdb.org/). We need to first create the database and then add the penguins data to it.
```{r}
db <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
dbWriteTable(db, "penguins", penguins)
```
We can see that our database now has one table:
```{r}
DBI::dbListTables(db)
```
And now that the data is in a database we could use SQL to get the first rows that we saw before.
```{r}
dbGetQuery(db, "SELECT * FROM penguins LIMIT 5")
```
::: {.callout-tip collapse="true"}
## Connecting to databases from R
Database connections from R can be made using the [DBI package](https://dbi.r-dbi.org/). The back-end for `DBI` is facilitated by database specific driver packages. In the code snipets above we created a new, empty, in-process [duckdb](https://duckdb.org/) database to which we then added our dataset. But we could have instead connected to an existing duckdb database. This could, for example, look like
```{r, eval = FALSE}
db <- dbConnect(duckdb::duckdb(),
dbdir = here("my_duckdb_database.ducdkb"))
```
In this book for simplicity we will mostly be working with in-process duckdb databases with synthetic data. However, when analysing real patient data we will be more often working with client-server databases, where we are connecting from our computer to a central server with the database or working with data held in the cloud. The approaches shown throughout this book will work in the same way for these other types of database management systems, but the way to connect to the database will be different (although still using DBI). In general, creating connections is supported by associated back-end packages. For example a connection to a Postgres database would use the RPostgres R package and look something like:
```{r, eval=FALSE}
db <- DBI::dbConnect(RPostgres::Postgres(),
dbname = Sys.getenv("CDM5_POSTGRESQL_DBNAME"),
host = Sys.getenv("CDM5_POSTGRESQL_HOST"),
user = Sys.getenv("CDM5_POSTGRESQL_USER"),
password = Sys.getenv("CDM5_POSTGRESQL_PASSWORD"))
```
:::
## Translation from R to SQL
Instead of using SQL to query our database, we might instead want to use the same R code as before. However, instead of working with the local dataset, now we will need it to query the data held in the database. To do this, first we can create a reference to the table in the database as such:
```{r}
penguins_db <- tbl(db, "penguins")
penguins_db
```
Once we have this reference, we can then use it with familiar looking R code.
```{r}
head(penguins_db, 5)
```
The magic here is provided by the `dbplyr` package, which takes the R code and converts it into SQL. In this case the query looks like the SQL we wrote directly before.
```{r}
head(penguins_db, 5) |>
show_query()
```
## Example analysis
More complicated SQL can also be generated by using familiar `dplyr` code. For example, we could get a summary of bill length by species like so:
```{r, warning=FALSE}
penguins_db |>
group_by(species) |>
summarise(
n = n(),
min_bill_length_mm = min(bill_length_mm),
mean_bill_length_mm = mean(bill_length_mm),
max_bill_length_mm = max(bill_length_mm)
) |>
mutate(min_max_bill_length_mm = paste0(
min_bill_length_mm,
" to ",
max_bill_length_mm
)) |>
select(
"species",
"mean_bill_length_mm",
"min_max_bill_length_mm"
)
```
The benefit of using `dbplyr` now becomes quite clear if we take a look at the corresponding SQL that is generated for us:
```{r, warning=FALSE}
penguins_db |>
group_by(species) |>
summarise(
n = n(),
min_bill_length_mm = min(bill_length_mm),
mean_bill_length_mm = mean(bill_length_mm),
max_bill_length_mm = max(bill_length_mm)
) |>
mutate(min_max_bill_length_mm = paste0(min, " to ", max)) |>
select(
"species",
"mean_bill_length_mm",
"min_max_bill_length_mm"
) |>
show_query()
```
Instead of having to write this somewhat complex SQL specific to `duckdb` we can use the friendlier `dplyr` syntax that may well be more familiar if coming from an R programming background.
Not having to worry about the SQL translation behind our queries allows us to interrogate the database in a simple way even for more complex questions. For instance, suppose now that we are particularly interested in the body mass variable. We can first notice that there are a couple of missing records for this.
```{r}
penguins_db |>
mutate(missing_body_mass_g = if_else(
is.na(body_mass_g), 1, 0
)) |>
group_by(species, missing_body_mass_g) |>
tally()
```
We can get the mean for each of the species (dropping those two missing records).
```{r, warning=FALSE, message=FALSE}
penguins_db |>
group_by(species) |>
summarise(mean_body_mass_g = round(mean(body_mass_g, na.rm = TRUE)))
```
We could also make a histogram of values for each of the species. Here we would collect our data back into R before creating our plot.
```{r, warning=FALSE, message=FALSE}
penguins_db |>
select("species", "body_mass_g") |>
collect() |>
ggplot(aes(group = species, fill = species)) +
facet_grid(species ~ .) +
geom_histogram(aes(body_mass_g), colour = "black", binwidth = 100) +
xlab("Body mass (g)") +
theme_bw() +
theme(legend.position = "none")
```
Now let's look at the relationship between body mass and bill depth.
```{r, warning=FALSE, message=FALSE}
penguins |>
select("species", "body_mass_g", "bill_depth_mm") |>
collect() |>
ggplot(aes(x = bill_depth_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Bill depth (mm)") +
ylab("Body mass (g)") +
theme_bw() +
theme(legend.position = "none")
```
Here we see a negative correlation between body mass and bill depth which seems rather unexpected. But what about if we stratify this query by species?
```{r, warning=FALSE, message=FALSE}
penguins |>
select("species", "body_mass_g", "bill_depth_mm") |>
collect() |>
ggplot(aes(x = bill_depth_mm, y = body_mass_g)) +
facet_grid(species ~ .) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Bill depth (mm)") +
ylab("Body mass (g)") +
theme_bw() +
theme(legend.position = "none")
```
As well as having an example of working with data in database from R, you also have an example of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)!
## Disconnecting from the database
Now that we've reached the end of this example, we can close our connection to the database using the `DBI` package.
```{r}
dbDisconnect(db)
```
## Further reading
- [R for Data Science (Chapter 13: Relational data)](https://r4ds.hadley.nz/databases)
- [Writing SQL with dbplyr](https://dbplyr.tidyverse.org/articles/sql.html)
- [Data Carpentry: SQL databases and R](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html)