-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsim.Rmd
481 lines (358 loc) · 14.4 KB
/
sim.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
---
title: "Doubly Robust Estimator: simulation activity in R"
description: |
---
# Computational Simulation {#sim}
- *Goal*: estimate a causal effect when you do not have data from a randomized experiment
- *Strategy 1*: Re-weighting each observation by the probability of receiving A or B so that the data approximates a randomized experiment
- *Strategy 2*: Modelling the outcome directly with a linear regression
- *Combining Idea 1 and 2*: To form doubly robust estimator
---
```{r, include =F}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, tidy = TRUE, cache= F)
```
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE,
fig.align = "center")
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(cowplot)
library(AIPW)
library(SuperLearner)
```
# Setting up data
We assign $X_1$ and $X_2$ as causal variables, and define a true ACE of 10.
```{r setting-up}
simu_observational_data <- function(simu_id = 1, n_obs) {
X_1 <- rnorm(n_obs)
X_2 <- rnorm(n_obs)
XB <- 0.2*X_1 + 0.2*X_2
prob_A <- exp(XB) / (1 + exp(XB))
A <- rbinom(n_obs, 1, prob_A)
# We set the true causal effect of treatment A is 10
Y <- 100 + X_1 + 2*X_2 + 10*A + rnorm(n_obs)
# summarize variables into a data frame
data.frame(simu_id = simu_id, n_obs = n_obs,
var_1 = X_1, var_2 = X_2,
treatment = A, outcome = Y)
}
```
```{r}
n_simu <- 150
n_obs <- 200
set.seed(1)
data <-
purrr::map2_dfr(1:n_simu, rep(n_obs, n_simu), simu_observational_data) %>%
group_by(simu_id)
sl.lib <- c("SL.mean","SL.glm")
```
## IPW model
We create the correct model of IPW with both variables, $X_1$ and $X_2$.
```{r}
prop_score_model <- function(data) {
glm(treatment ~ var_1 + var_2, data = data, family = 'binomial')
}
```
```{r}
#IPW estimator is correct
ipw_estimator <- function(data, model) {
data %>%
mutate(
prob = predict(model, newdata = data, type = 'response'),
) %>%
summarise(
# Calculate expected value for both treatment and non-treatment using IPW equation
EY0_ipw = mean(outcome*(1 - treatment) / (1 - prob)),
EY1_ipw = mean(outcome*treatment / prob)
) %>%
# Calculate ACE
mutate(ipw = EY1_ipw - EY0_ipw)
}
```
We miss $X_2$ in the model, which makes the model misspecified.
```{r}
prop_score_model_w <- function(data) {
# OOPS! I forgot var_2
glm(treatment ~ var_1, data = data, family = 'binomial')
}
```
```{r}
# IPW estimator is wrong
ipw_estimator_w <- function(data, model) {
data %>%
mutate(
prob = predict(model, newdata = data, type = 'response'),
) %>%
summarise(
# Calculate expected value for both treatment and non-treatment using IPW equation
EYB_ipw = mean(outcome*(1 - treatment) / (1 - prob)),
EYA_ipw = mean(outcome*treatment / prob)
) %>%
# Calculate ACE
mutate(ipw_w = EYA_ipw - EYB_ipw)
}
```
## REG model
We create the correct model of outcome regression model with both variables, $X_1$ and $X_2$.
```{r}
mean_outcome_model <- function(data) {
glm(outcome ~ treatment + var_1 + var_2, data = data)
}
```
```{r}
#REG estimator is correct
outcome_model_estimator <- function(data) {
mean_model <- mean_outcome_model(data) # Compute the model
summary(mean_model)$coefficients['treatment', ][1] # Coefficient for treatment
}
```
We miss $X_2$ in this model, which makes this REG model misspecified.
```{r}
mean_outcome_model_w <- function(data) {
# MISSING var_2??!?!
glm(outcome ~ var_1 + treatment, data = data)
}
```
```{r}
# REG estimator is wrong
outcome_model_estimator_w <- function(data) {
mean_model <- mean_outcome_model_w(data) # Compute the model
summary(mean_model)$coefficients['treatment', ][1] # Coefficient for treatment
}
```
## Data Simulation
We build the data with all different model specifications.
```{r}
n_simu <- 150
n_obs <- 200
set.seed(1)
nested_df <-
purrr::map2_dfr(1:n_simu, rep(n_obs, n_simu), simu_observational_data) %>%
group_by(simu_id) %>%
nest() %>%
mutate(
# correct ipw model
prop_model = map(data, prop_score_model),
# incorrect ipw model
prop_model_w = map(data, prop_score_model_w),
# correct reg model
mean_model = map(data, mean_outcome_model),
# incorrect reg model
mean_model_w = map(data, mean_outcome_model_w),
# correct reg estimation
model_estimate = map(data, outcome_model_estimator),
# incorrect reg estimation
model_estimate_w = map(data, outcome_model_estimator_w),
# correct ipw estimation
ipw_estimate = map2(data, prop_model, ipw_estimator),
# incorrect ipw estimation
ipw_estimate_w = map2(data, prop_model_w, ipw_estimator_w),
) %>%
ungroup() %>%
unnest(c(model_estimate, model_estimate_w, ipw_estimate, ipw_estimate_w))
```
Now, we have five models: IPW (both correct and incorrect), REG (both correct and incorrect), and DRE.
---
# Strategy 1: Inverse Probability Weighting (IPW)
## Intuition Behind it
- In an experiment the probability of receiving a treatment are always equal across all units (50%)
- In the current case, the probability of receiving the treatment depends on variables that affect the outcome
- If we knew what this probabilities are we could re-weight our sample such that the data would better match a randomized experiment - by creating the "pesudopopulation" (mentioned earlier)
- Reminder: propensity score is just a logistic regression of the probability of receiving the treatment, giving their $Z$ values (confounders)
In our example, we proposed the correct IPW model:
```{r, eval = FALSE}
prop_score_model <- function(data) {
glm(treatment ~ var_1 + var_2, data = data, family = 'binomial')
}
```
## Introducing IPW estimator
- Similar to a difference of means but weights each observation inversely proportional to its probability of receiving a treatment (propensity score $\widehat {PS_i}$)
$$\hat{\delta}_{IPW} = \frac{1}{n}\sum_{i=1}^{n}\bigg[\frac{E[Y_{i}|{A=1}]}{\color{red}{\widehat {PS_i}}} - \frac{E[Y_{i}|{A=0}]}{\color{red}{1-\widehat {PS_i}}}\bigg]$$
Generating IPW estimator and calculate ACE by IPW equation:
```{r, eval = FALSE}
ipw_estimator <- function(data, model) {
data %>%
mutate(
prob = predict(model, newdata = data, type = 'response'),
) %>%
summarise(
EY0_ipw = mean(outcome*(1 - treatment) / (1 - prob)),
EY1_ipw = mean(outcome*treatment / prob)
) %>%
mutate(ipw = EY1_ipw - EY0_ipw)
}
```
## IPW Performance
Plot for correct IPW
```{r, echo = FALSE}
mean_ipw_estimator <-
mean(nested_df$ipw)
nested_df %>%
ggplot(aes(x = simu_id, y = ipw)) +
ggtitle(paste('The mean of the ipw estimate (correctly specified) is ', round(mean_ipw_estimator, 2))) +
geom_point(color = 'blue') +
geom_hline(yintercept = 10) + ylab("ACE estimate")+
theme_bw()
```
The plot shows the data points are evenly spread around 10, and the mean of correct IPW estimate is 10, which is very close to the true ACE we set.
## IPW could fail
- If the model for A is incorrect, it would cause bias in the IPW, which cause IPW to fail. We used the biased ipw results from the 'nested_df' dataset.
```{r, echo = FALSE}
mean_ipw_estimator_w <-
mean(nested_df$ipw_w)
nested_df %>%
ggplot(aes(x = simu_id, y = ipw_w)) +
ggtitle(paste('The mean of the ipw estimate (incorrectly specified) is ', round(mean_ipw_estimator_w, 2))) +
geom_point(color = 'blue') +
geom_hline(yintercept = 10) +ylab("ACE estimate")+
theme_bw()
```
The plot shows a obvious trend that all data points from biased IPW model shift upward, which the ACE for incorrect IPW model is higher than the ACE=10 we set in the beginning.
```{r, echo=F}
a <- ggplot(nested_df, aes(x=simu_id, y=ipw, group=1)) +
geom_boxplot()+
geom_hline(yintercept=10, color='red')+ylab("ACE estimate")+
ylim(7, 13.5)
b <- ggplot(nested_df, aes(x=simu_id, y=ipw_w, group=1))+
geom_boxplot()+
geom_hline(yintercept=10, color='red')+ylab("ACE estimate")+
ylim(7, 13.5)
plot_grid(a, b, labels=c("Correct IPW", "Incorrect IPW"), ncol = 2, nrow = 1)
```
We overestimated the true ACE (which is 10) if we wrongly specified the model form in IPW setting.
---
# Strategy 2: REG estimator (model the outcome)
```{r, eval = FALSE}
outcome_model_estimator <- function(data) {
mean_model <- mean_outcome_model(data)
summary(mean_model)$coefficients['treatment', ][1]
}
```
## REG Performance
```{r, echo = FALSE}
mean_model_estimator <-
mean(nested_df$model_estimate)
nested_df %>%
ggplot(aes(x = simu_id, y = model_estimate)) +
ggtitle(paste('The mean of the REG estimate (correctly specified) is ', round(mean_model_estimator, 2))) +
geom_point(color = 'blue') +
geom_hline(yintercept = 10) +ylab("ACE estimate")+
theme_bw()
```
When outcome model is correctly modelled, this shows unbiased estimation of ACE.
## REG could also fail
- If the model for Y is incorrect (as we recalled we left *var_2* out), it will cause bias in the REG outcome model, which cause the model to fail.
```{r, echo=F}
mean_model_estimator_w <-
mean(nested_df$model_estimate_w)
nested_df %>%
ggplot(aes(x = simu_id, y = model_estimate_w)) +
ggtitle(paste('The mean of the REG estimate (incorrectly specified) is ', round(mean_model_estimator_w, 2))) +
geom_point(color = 'blue') +
geom_hline(yintercept = 10) +ylab("ACE estimate")+
theme_bw()
```
```{r, echo=F}
c <- ggplot(nested_df, aes(x=simu_id, y=model_estimate, group=1)) +
geom_boxplot()+geom_hline(yintercept=10, color='red')+ylab("ACE estimate")+
ylim(9, 11)
d <- ggplot(nested_df, aes(x=simu_id, y=model_estimate_w, group=1))+
geom_boxplot()+geom_hline(yintercept=10, color='red')+ylab("ACE estimate")+
ylim(9, 11)
plot_grid(c, d, labels=c("Correct REG", "Incorrect REG"), ncol = 2, nrow = 1)
```
We also overestimated the true ACE (10) if our REG model is mis-specified.
---
# Combining Strategy 1 and 2
- If the propensity score model is incorrect, strategy 1 will not work
- If the outcome model is incorrect, strategy 2 will not work
- If you combine both approaches, you just need either one to work but not both
- This is "doubly robust" property!
## Doubly Robust Estimator
DRE's estimation of average causal effect can be summarized into this equation (it is the same thing as the one in the previous section but we changed some notations):
$$\hat{\delta}_{DRE} = \frac{1}{n}\sum_{i=1}^{n}\bigg[\frac{Y_{i}A_{i} -\color{red}{(A_i-\pi(Z_{i}))\mu(Z_i, A_i)}}{\pi(Z_{i})} - \frac{Y_{i}(1-A_{i}) -\color{red}{(A_i-\pi(Z_{i}))\mu(Z_i, A_i)}}{1-\pi(Z_{i})}\bigg]$$
where $\pi(Z_i)=\widehat {PS_i}$ (propensity score from IPW), and $\mu(Z_i, A_i)=E[Y|A=1,Z]$ (this is the part of outcome regression model in DRE model). The term in red is said to *augment* the IPW estimator.
### When IPW and REG both correct
```{r}
AIPW_00 <- AIPW$new(Y= data$outcome,
A= data$treatment,
W= subset(data,select=c(var_1,var_2)), # Covariates for both IPW and REG models
Q.SL.library = sl.lib, # Algorithms used for the outcome model (Q).
g.SL.library = sl.lib, # Algorithms used for the exposure model (g).
k_split = 10, # Number of folds for splitting
verbose=FALSE)
AIPW_00$fit()
# Estimate the average causal effects
AIPW_00$summary(g.bound = 0.25)
dre_00 <- AIPW_00$estimates$risk_A1[1] - AIPW_00$estimates$risk_A0[1] # Calculate ACE
dre_00
```
This is fine - we have a very unbiased estimation.
---
Let's analyze DRE's performance under following settings:
### When IPW is Incorrect, REG is correct
Let's explore what happens to DRE when something goes wrong!
```{r}
AIPW_01 <- AIPW$new(Y= data$outcome,
A= data$treatment,
W.Q= subset(data,select=c(var_1,var_2)), # Covariates for the REG model.
W.g= subset(data,select=var_1), # Covariates for the IPW model.
Q.SL.library = sl.lib,
g.SL.library = sl.lib,
k_split = 10,
verbose=FALSE)
AIPW_01$fit()
AIPW_01$summary(g.bound = 0.25)
dre_01 <- AIPW_01$estimates$risk_A1[1] - AIPW_01$estimates$risk_A0[1]
dre_01
```
The mean of estimates are around 10. This indicates that this estimator is unbiased when IPW is misspecified.
---
### When REG is incorrect, IPW is correct
```{r}
AIPW_10 <- AIPW$new(Y= data$outcome,
A= data$treatment,
W.Q= subset(data,select=var_1),
W.g= subset(data,select=c(var_1,var_2)),
Q.SL.library = sl.lib,
g.SL.library = sl.lib,
k_split = 10,
verbose=FALSE)
AIPW_10$fit()
AIPW_10$summary(g.bound = 0.25)
dre_10 <- AIPW_10$estimates$risk_A1[1] - AIPW_10$estimates$risk_A0[1]
dre_10
```
The mean of estimates are also around 10. This indicates that this estimator is unbiased when REG is wrongly modeled.
### When both REG and IPW are incorrect - DRE fails!!
Let’s look at what happens to DRE estimation when both models are incorrectly identified. From previous slides, we learned how robust DRE is - its unbiasedness holds even if one of the models is incorrect, but this is not the case here!
```{r}
AIPW_11 <- AIPW$new(Y= data$outcome,
A= data$treatment,
W.Q= subset(data,select=var_1),
W.g= subset(data,select=var_1),
Q.SL.library = sl.lib,
g.SL.library = sl.lib,
k_split = 10,
verbose=FALSE)
AIPW_11$fit()
AIPW_11$summary(g.bound = 0.25)
dre_11 <- AIPW_11$estimates$risk_A1[1] - AIPW_11$estimates$risk_A0[1]
dre_11
```
The mean of estimates are above 10 (10.46). This indicates that this estimator is biased when both REG and IPW are wrongly modeled.
**Looking at them together!**
```{r,echo=FALSE}
df <- data.frame(c(dre_00,dre_01,dre_10,dre_11))
rownames(df)<- c("Both Correct","IPW Incorrect","REG Incorrect","Both Incorrect")
colnames(df) <- "ACE estimates"
df
```
---
# Conclusion
This simulation activity shows how effective DRE is - that its effectiveness depends on correct model specification or outcome regression model (REG), or inverse Probability weighting by propensity score (IPW). Its robustness is maintained when one of the models is incorrect.
From previous work, we have also identified and learned why DRE is unbiased when either of them is incorrect. If we have non-randomized trial experiments, we can surely apply methods introduced here today!