-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexploring_the_cdm.qmd
210 lines (173 loc) · 6.82 KB
/
exploring_the_cdm.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
## Exploring the OMOP CDM
For this chapter, we'll use a synthetic Covid-19 dataset.
```{r, warning=FALSE, message=FALSE}
library(DBI)
library(dbplyr)
library(dplyr)
library(here)
library(CDMConnector)
library(ggplot2)
library(clock)
```
```{r, warning=FALSE, message=FALSE}
db<-dbConnect(duckdb::duckdb(),
dbdir = eunomiaDir(datasetName = "synthea-covid19-10k"))
cdm <- cdmFromCon(db, cdmSchema = "main", writeSchema = "main")
```
```{r, echo = TRUE}
cdm
```
## Counting people
The OMOP CDM is person-centric, with the person table containing records to uniquely identify each person in the database. As each row refers to a unique person, we can quickly get a count of the number of individuals in the database like so
```{r, echo = TRUE}
cdm$person |>
count()
```
The person table also contains some demographic information, including a gender concept for each person. We can get a count grouped by this variable, but as this uses a concept we'll also need to join to the concept table to get the corresponding concept name for each concept id.
```{r, echo = TRUE}
cdm$person |>
group_by(gender_concept_id) |>
count() |>
left_join(cdm$concept,
by=c("gender_concept_id" = "concept_id")) |>
select("gender_concept_id", "concept_name", "n") |>
collect()
```
::: {.callout-tip collapse="true"}
## Vocabulary tables
Above we've got counts by specific concept IDs recorded in the condition occurrence table. What these IDs represent is described in the concept table. Here we have the name associated with the concept, along with other information such as it's domain and vocabulary id.
```{r, echo = TRUE}
cdm$concept |>
glimpse()
```
Other vocabulary tables capture other information about concepts, such as the direct relationships between concepts (the concept relationship table) and hierarchical relationships between (the concept ancestor table).
```{r, echo = TRUE}
cdm$concept_relationship |>
glimpse()
cdm$concept_ancestor |>
glimpse()
```
More information on the vocabulary tables (as well as other tables in the OMOP CDM version 5.3) can be found at <https://ohdsi.github.io/CommonDataModel/cdm53.html#Vocabulary_Tables>.
:::
## Summarising observation periods
The observation period table contains records indicating spans of time over which clinical events can be reliably observed for the people in the person table. Someone can potentially have multiple observation periods. So say we wanted a count of people grouped by the year during which their first observation period started. We could do this like so:
```{r, echo = TRUE}
first_observation_period <- cdm$observation_period |>
group_by(person_id) |>
filter(row_number() == 1) |>
compute()
cdm$person |>
left_join(first_observation_period,
by = "person_id") |>
mutate(observation_period_start_year = get_year(observation_period_start_date)) |>
group_by(observation_period_start_year) |>
count() |>
collect() |>
ggplot() +
geom_col(aes(observation_period_start_year, n)) +
theme_bw()
```
## Summarising clinical records
What's the number of condition occurrence records per person in the database? We can find this out like so
```{r, echo = TRUE}
cdm$person |>
left_join(cdm$condition_occurrence |>
group_by(person_id) |>
count(name = "condition_occurrence_records"),
by="person_id") |>
mutate(condition_occurrence_records = if_else(
is.na(condition_occurrence_records), 0,
condition_occurrence_records)) |>
group_by(condition_occurrence_records) |>
count() |>
collect() |>
ggplot() +
geom_col(aes(condition_occurrence_records, n)) +
theme_bw()
```
How about we were interested in getting record counts for some specific concepts related to Covid-19 symptoms?
```{r, echo = TRUE}
cdm$condition_occurrence |>
filter(condition_concept_id %in% c(437663,437390,31967,
4289517,4223659, 312437,
434490,254761,77074)) |>
group_by(condition_concept_id) |>
count() |>
left_join(cdm$concept,
by=c("condition_concept_id" = "concept_id")) |>
collect() |>
ggplot() +
geom_col(aes(concept_name, n)) +
theme_bw()+
xlab("")
```
We can also use summarise for various other calculations
```{r, echo = TRUE}
cdm$person |>
summarise(min_year_of_birth = min(year_of_birth, na.rm=TRUE),
q05_year_of_birth = quantile(year_of_birth, 0.05, na.rm=TRUE),
mean_year_of_birth = round(mean(year_of_birth, na.rm=TRUE),0),
median_year_of_birth = median(year_of_birth, na.rm=TRUE),
q95_year_of_birth = quantile(year_of_birth, 0.95, na.rm=TRUE),
max_year_of_birth = max(year_of_birth, na.rm=TRUE)) |>
glimpse()
```
As we've seen before, we can also quickly get results for various groupings or restrictions
```{r, echo = TRUE}
grouped_summary <- cdm$person |>
group_by(gender_concept_id) |>
summarise(min_year_of_birth = min(year_of_birth, na.rm=TRUE),
q25_year_of_birth = quantile(year_of_birth, 0.25, na.rm=TRUE),
median_year_of_birth = median(year_of_birth, na.rm=TRUE),
q75_year_of_birth = quantile(year_of_birth, 0.75, na.rm=TRUE),
max_year_of_birth = max(year_of_birth, na.rm=TRUE)) |>
left_join(cdm$concept,
by=c("gender_concept_id" = "concept_id")) |>
collect()
grouped_summary |>
ggplot(aes(x = concept_name, group = concept_name,
fill = concept_name)) +
geom_boxplot(aes(
lower = q25_year_of_birth,
upper = q75_year_of_birth,
middle = median_year_of_birth,
ymin = min_year_of_birth,
ymax = max_year_of_birth),
stat = "identity", width = 0.5) +
theme_bw()+
theme(legend.position = "none") +
xlab("")
```
What's the number of condition occurrence records per person in the database? We can find this out like so
```{r, echo = TRUE}
cdm$person |>
left_join(cdm$condition_occurrence |>
group_by(person_id) |>
count(name = "condition_occurrence_records"),
by="person_id") |>
mutate(condition_occurrence_records = if_else(
is.na(condition_occurrence_records), 0,
condition_occurrence_records)) |>
group_by(condition_occurrence_records) |>
count() |>
collect() |>
ggplot() +
geom_col(aes(condition_occurrence_records, n)) +
theme_bw()
```
How about we were interested in getting record counts for some specific concepts related to Covid-19 symptoms?
```{r, echo = TRUE}
cdm$condition_occurrence |>
filter(condition_concept_id %in% c(437663,437390,31967,
4289517,4223659, 312437,
434490,254761,77074)) |>
group_by(condition_concept_id) |>
count() |>
left_join(cdm$concept,
by=c("condition_concept_id" = "concept_id")) |>
collect() |>
ggplot() +
geom_col(aes(concept_name, n)) +
theme_bw()+
xlab("")
```