-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcdm_reference.qmd
199 lines (144 loc) · 6.66 KB
/
cdm_reference.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Creating a cdm reference {#sec-cdm_reference}
## The OMOP CDM layout
The OMOP CDM standardises the structure of health care data. Data is stored across a system of tables with established relationships between them. In other words, the OMOP CDM provides a relational database structure, with version 5.4 of the OMOP CDM shown below. {.lightbox}
## Creating a reference to the OMOP common data model
As we saw in @sec-dbplyr_packages, creating a data model in R to represent the OMOP CDM can provide a basis for analytic pipleines using the data. Luckily for us we won't have to create functions and methods for this ourselves. Instead we will use the `omopgenerics` package which defines a data model for OMOP CDM data and the `CDMConnector` package which provides functions for connecting to a OMOP CDM data held in a database.
To see how this works we will use the `omock` to create example data in the format of the OMOP CDM, which we then copy to a duckdb database.
```{r, message=FALSE, warning=FALSE}
library(DBI)
library(here)
library(dplyr)
library(omock)
library(omopgenerics)
library(CDMConnector)
cdm_local <- omock::mockCdmReference() |>
omock::mockPerson(nPerson = 100) |>
omock::mockObservationPeriod() |>
omock::mockConditionOccurrence() |>
omock::mockDrugExposure() |>
omock::mockObservation() |>
omock::mockMeasurement() |>
omock::mockVisitOccurrence() |>
omock::mockProcedureOccurrence()
db <- DBI::dbConnect(duckdb::duckdb())
CDMConnector::copyCdmTo(con = db,
cdm = cdm_local,
schema ="main",
overwrite = TRUE)
```
Now we have OMOP CDM data in a database we can use `cdmFromCon()` to create our cdm reference. Note that as well as specifying the schema containing our OMOP CDM tables, we will also specify a write schema where any database tables we create during our analysis will be stored (often our OMOP CDM tables will be in a schema that we only have read-access to and we'll have another schema where we can have write-access where intermediate tables can be created for a given a study).
```{r, message=FALSE, warning=FALSE, echo = TRUE}
cdm <- cdmFromCon(db,
cdmSchema = "main",
writeSchema = "main",
cdmName = "example_data")
```
```{r}
cdm
```
::: {.callout-tip collapse="true"}
## Setting a write prefix
We can also specify a write prefix and this will be used whenever permanent tables are created in the write schema. This can be useful when we're sharing our write schema with others and want to avoid table name conflicts and easily drop tables created as part of a particular study.
```{r, message=FALSE, warning=FALSE, echo = TRUE, eval = FALSE}
cdm <- cdmFromCon(con = db,
cdmSchema = "main",
writeSchema = "main",
writePrefix = "my_study_",
cdmName = "example_data")
```
:::
We can see that we now have an object that contains references to all the OMOP CDM tables. We can reference specific tables using the "\$" or "\[\[ ... \]\]" operators.
```{r, echo = TRUE}
cdm$person
cdm[["observation_period"]]
```
## CDM attributes
### CDM name
Our cdm reference will be associated with a name. By default this name will be taken from the cdm source name field from the cdm source table.
```{r, message=FALSE, warning=FALSE, echo = TRUE}
cdm <- cdmFromCon(db,
cdmSchema = "main",
writeSchema = "main")
cdm$cdm_source
cdmName(cdm)
cdmName(cdm$person)
```
However, we can instead set this name when creating our cdm reference.
```{r, message=FALSE, warning=FALSE, echo = TRUE}
cdm <- cdmFromCon(db,
cdmSchema = "main",
writeSchema = "main",
cdmName = "my_cdm")
cdmName(cdm)
cdmName(cdm$person)
```
::: {.callout-tip collapse="true"}
## Behind the scenes
The cdm reference we have has a class of cdm_reference, while each of the tables have
```{r, message=FALSE, warning=FALSE}
class(cdm)
class(cdm$person)
```
We can see that cdmName() is a generic function, which works for both the cdm reference as a whole and individual tables.
```{r, message=FALSE, warning=FALSE}
library(sloop)
s3_dispatch(cdmName(cdm))
s3_dispatch(cdmName(cdm$person))
```
:::
### CDM version
We can also easily check the OMOP CDM version that is being used
```{r, echo = TRUE}
cdmVersion(cdm)
```
### CDM Source
Although typically we won't need to use them for writing study code, we can also access lower-level information on the source, such as the database connection.
```{r, echo = TRUE}
attr(cdmSource(cdm), "dbcon")
```
## Mutability of the cdm reference
An important characteristic of our cdm reference is that we can alter the tables in R, but the OMOP CDM data will not be affected.
For example, let's say we want to perform a study with only people born in 1970. For this we could filter our person table to only people born in this year.
```{r, echo = TRUE}
cdm$person <- cdm$person |>
filter(year_of_birth == 1970)
cdm$person
```
From now on, when we work with our cdm reference this restriction will continue to have been applied.
```{r, echo = TRUE}
cdm$person |>
tally()
```
The original OMOP CDM data itself however will remain unaffected. And we can see if we create our reference again that the underlying data is unchanged.
```{r, message=FALSE, warning=FALSE, echo = TRUE}
cdm <- cdmFromCon(con = db,
cdmSchema = "main",
writeSchema = "main",
cdmName = "Synthea Covid-19 data")
cdm$person |>
tally()
```
The mutability of our cdm reference is a useful feature for studies as it means we can easily tweak our OMOP CDM data if needed. Meanwhile, leaving the underlying data unchanged is essential so that other study code can run against the data unaffected by any of our changes.
One thing we can't do though is alter the structure of OMOP CDM tables. For example this code would cause an error as the person table must have the column person_id.
```{r, error=TRUE}
cdm$person <- cdm$person |>
rename("new_id" = "person_id")
```
In such a case we would have to call the table something else.
```{r, eval=FALSE}
cdm$person_new <- cdm$person |>
rename("new_id" = "person_id") |>
compute()
```
Now we would have this new table as an additional table in our cdm reference, knowing it was not in the format of one of the core OMOP CDM tables.
```{r, echo = TRUE}
cdm
```
# Disconnecting
Once we have finished our analysis we can close our connection to the database behind our cdm reference like so.
```{r}
cdmDisconnect(cdm)
```
# Further reading
- [omopgenerics package](https://darwin-eu.github.io/omopgenerics)
- [CDMConnector package](https://darwin-eu.github.io/CDMConnector)