Skip to content

Commit 5654a48

Browse files
authored
Merge pull request #21 from oxford-pharmacoepi/revision_ch1
Revision chapter 1
2 parents 2bbf5b4 + ddf50b9 commit 5654a48

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

working_with_databases_from_r.qmd

+20-18
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,17 @@
22

33
![](images/lter_penguins.png){width="250"}
44

5-
*Artwork by \@allison_horst*
5+
*Artwork by [\@allison_horst](https://x.com/allison_horst)*
66

7-
Before we start thinking about working with health care data spread across a database using the OMOP common data model, let's first do a quick data analysis from R using a simpler dataset held in a database to quickly understand the general approach. For this we'll use data from [palmerpenguins package](https://allisonhorst.github.io/palmerpenguins/), which contains data on penguins collected from the [Palmer Station](https://en.wikipedia.org/wiki/Palmer_Station) in Antarctica.
7+
Before we start thinking about working with healthcare data spread across a database using the OMOP common data model, let's first do a quick data analysis with R using a simpler dataset held in a database to quickly understand the general approach. For this we'll use data from [palmerpenguins package](https://allisonhorst.github.io/palmerpenguins/), which contains data on penguins collected from the [Palmer Station](https://en.wikipedia.org/wiki/Palmer_Station) in Antarctica.
88

99
## Getting set up
1010

11-
Assuming that you have R and RStudio already set up, first we need to install a few packages not included in base R if we don´t already have them.
11+
Assuming that you have R and RStudio already set up, first we need to install a few packages not included in base R if we don't already have them.
1212

1313
```{r, eval=FALSE}
1414
install.packages("dplyr")
15+
install.packages("dbplyr")
1516
install.packages("ggplot2")
1617
install.packages("DBI")
1718
install.packages("duckdb")
@@ -22,6 +23,7 @@ Once installed, we can load them like so.
2223

2324
```{r, message=FALSE, warning=FALSE}
2425
library(dplyr)
26+
library(dbplyr)
2527
library(ggplot2)
2628
library(DBI)
2729
library(duckdb)
@@ -30,34 +32,34 @@ library(palmerpenguins)
3032

3133
## Taking a peek at the data
3234

33-
We can get an overview of the data using the `glimpse()` command.
35+
The package `palmerpenguins` contains two datasets, one of them called `penguins`, which we will use in this chapter. We can get an overview of the data using the `glimpse()` command.
3436

3537
```{r}
3638
glimpse(penguins)
3739
```
3840

39-
Or we could take a look at the first rows of the data using `head()`
41+
Or we could take a look at the first rows of the data using `head()` :
4042

4143
```{r}
4244
head(penguins, 5)
4345
```
4446

4547
## Inserting data into a database
4648

47-
Let's put our penguins data into a [duckdb database](https://duckdb.org/). We create the database, add the penguins data, and then create a reference to the table containing the data.
49+
Let's put our penguins data into a [duckdb database](https://duckdb.org/). We need to first create the database and then add the penguins data to it.
4850

4951
```{r}
5052
db <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
5153
dbWriteTable(db, "penguins", penguins)
5254
```
5355

54-
We can see that our database now has one table
56+
We can see that our database now has one table:
5557

5658
```{r}
5759
DBI::dbListTables(db)
5860
```
5961

60-
And now that the data is in a database we could use SQL to get the first rows that we saw before
62+
And now that the data is in a database we could use SQL to get the first rows that we saw before.
6163

6264
```{r}
6365
dbGetQuery(db, "SELECT * FROM penguins LIMIT 5")
@@ -66,14 +68,14 @@ dbGetQuery(db, "SELECT * FROM penguins LIMIT 5")
6668
::: {.callout-tip collapse="true"}
6769
## Connecting to databases from R
6870

69-
Database connections from R can be made using the [DBI package](https://dbi.r-dbi.org/). The back-end for `DBI` is facilitated by database specific driver packages. Above we we created a new, empty, in-process [duckdb](https://duckdb.org/) database which we then added database. But we could have instead connected to an existing duckdb database. This could, for example, look like
71+
Database connections from R can be made using the [DBI package](https://dbi.r-dbi.org/). The back-end for `DBI` is facilitated by database specific driver packages. In the code snipets above we created a new, empty, in-process [duckdb](https://duckdb.org/) database to which we then added our dataset. But we could have instead connected to an existing duckdb database. This could, for example, look like
7072

7173
```{r, eval = FALSE}
7274
db <- dbConnect(duckdb::duckdb(),
7375
dbdir = here("my_duckdb_database.ducdkb"))
7476
```
7577

76-
In this book for simplicity we will mostly be working with in-process duckdb databases with synthetic data. However, when analysing real patient data we will be more often working with client-server databases, where we are connecting from our computer to a central server with the database or working with data held in the cloud. The approaches shown throughout this book will work in the same way for these other types of database management systems, but the way to connect to the database will be different (although still using DBI). In general, creating connections are supported by associated back-end packages. For example a connection to a Postgres database would use the RPostgres R package and look something like:
78+
In this book for simplicity we will mostly be working with in-process duckdb databases with synthetic data. However, when analysing real patient data we will be more often working with client-server databases, where we are connecting from our computer to a central server with the database or working with data held in the cloud. The approaches shown throughout this book will work in the same way for these other types of database management systems, but the way to connect to the database will be different (although still using DBI). In general, creating connections is supported by associated back-end packages. For example a connection to a Postgres database would use the RPostgres R package and look something like:
7779

7880
```{r, eval=FALSE}
7981
db <- DBI::dbConnect(RPostgres::Postgres(),
@@ -86,7 +88,7 @@ db <- DBI::dbConnect(RPostgres::Postgres(),
8688

8789
## Translation from R to SQL
8890

89-
Instead of using SQL, we could instead use the same R code as before. Now it will query the data held in a database. To do this, first we create a reference to the table in the database.
91+
Instead of using SQL to query our database, we might instead want to use the same R code as before. However, instead of working with the local dataset, now we will need it to query the data held in the database. To do this, first we can create a reference to the table in the database as such:
9092

9193
```{r}
9294
penguins_db <- tbl(db, "penguins")
@@ -99,7 +101,7 @@ Once we have this reference, we can then use it with familiar looking R code.
99101
head(penguins_db, 5)
100102
```
101103

102-
The magic here is provided by the `dbplyr` which takes the R code and converts it into SQL, which in this case looks like the SQL we could have written instead.
104+
The magic here is provided by the `dbplyr` package, which takes the R code and converts it into SQL. In this case the query looks like the SQL we wrote directly before.
103105

104106
```{r}
105107
head(penguins_db, 5) |>
@@ -108,7 +110,7 @@ head(penguins_db, 5) |>
108110

109111
## Example analysis
110112

111-
More complicated SQL can also be generated by using familiar `dplyr` code. For example, we could get a summary of bill length by species like so
113+
More complicated SQL can also be generated by using familiar `dplyr` code. For example, we could get a summary of bill length by species like so:
112114

113115
```{r, warning=FALSE}
114116
penguins_db |>
@@ -131,7 +133,7 @@ penguins_db |>
131133
)
132134
```
133135

134-
The benefit of using dbplyr now becomes quite clear if we take a look at the corresponding SQL that is generated for us.
136+
The benefit of using `dbplyr` now becomes quite clear if we take a look at the corresponding SQL that is generated for us:
135137

136138
```{r, warning=FALSE}
137139
penguins_db |>
@@ -151,9 +153,9 @@ penguins_db |>
151153
show_query()
152154
```
153155

154-
Instead of having to write this somewhat complex SQL specific to duckdb we can use the friendlier dplyr syntax that may well be more familiar if coming from an R programming background.
156+
Instead of having to write this somewhat complex SQL specific to `duckdb` we can use the friendlier `dplyr` syntax that may well be more familiar if coming from an R programming background.
155157

156-
Now suppose we are particularly interested in the body mass variable. We can first notice that there are a couple of missing records for this.
158+
Not having to worry about the SQL translation behind our queries allows us to interrogate the database in a simple way even for more complex questions. For instance, suppose now that we are particularly interested in the body mass variable. We can first notice that there are a couple of missing records for this.
157159

158160
```{r}
159161
penguins_db |>
@@ -217,11 +219,11 @@ penguins |>
217219
theme(legend.position = "none")
218220
```
219221

220-
As well as having an example of working with data in database from R, you also have an example of [Simpson´s paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)!
222+
As well as having an example of working with data in database from R, you also have an example of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)!
221223

222224
## Disconnecting from the database
223225

224-
And now we've reached the end of this example, we can close our connection to the database.
226+
Now that we've reached the end of this example, we can close our connection to the database using the `DBI` package.
225227

226228
```{r}
227229
dbDisconnect(db)

0 commit comments

Comments
 (0)