-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.qmd
461 lines (413 loc) · 27.8 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
---
title: "Statistical Model Building Strategies for Cardiologic Applications - New results and future challenges"
editor:
markdown:
wrap: 80
---
*Statistical Model Building Strategies for Cardiologic Applications* (SAMBA) is
an ongoing German-Austrian joint research project funded by the Austrian Science
Fund (FWF) ([grant number
I4739-B](https://www.fwf.ac.at/en/research-radar/10.55776/I4739) to Daniela
Dunkler), and the Deutsche Forschungsgemeinschaft (DFG) (grant number RA-2347/8-1) to Geraldine Rauch and Heiko Becher. The Austrian part of the
project was completed in August 2024.
## Abstract
**Wider research context**
Statistical models that adequately describe disease progression and treatment
response are essential for development, improvement and judgment of therapies in
all medical fields. The aims of such models are either explanatory, descriptive
or predictive nature. The methodological needs and challenges for these distinct
aims are different. In medical applications, researchers often do not focus on a
single aim but the interest lies in a combination. Generally, the development of
a valid model relies on the identification of a meaningfully sized set of
explanatory variables and the specification of adequate functional forms.
Intensive statistical research on both aspects was performed for decades.
However, the results of this research are only poorly incorporated into clinical
research.
**Objectives**
This interdisciplinary project intends to build a bridge between statistical
research on model building strategies and implementation of the methodology into
actual medical research. The project aims at
1. identifying deficiencies in multivariable models that were developed for
cardiologic applications with respect to statistical model building
(selection of variables and functional forms),
2. building real advanced statistical models for four typical cardiologic
research questions by applying state-of-the-art methodology,
3. developing and evaluating new methods to correct for overestimation bias
arising in data-driven model building, and
4. providing guidance for model building strategies which are understandable
for applied researchers. In this project, we focus on descriptive models,
potentially combined with the aim to make a best possible prediction under
possible sample size limitations. By this, the aim of the project lies in
models where regression coefficients must be interpretable from a medical
point of view.
**Approach**
From a statistical point of view, the aim is to identify, discuss and improve
the current standards applied in clinical research with respect to model
building and variable identification. We will particularly address the impact of
sample size and highlight options and limitations as occurring in real life
situations. From a medical point of view, the aim is to gain new medical
insights from statistical models, which are built by employing better
methodology. We will use several original data sources of cardiovascular studies
and combine them with results from the corresponding medical literature.
**Innovation**
As a comprehensive result, we will be able to deduce methodologically improved
and valid model building strategies for each of the four exemplary applications.
## Current and former SAMBA participants
- Akbari, Nilufar [(Charité Universitätsmedizin
Berlin)](https://biometrie.charite.de/metas/person/person/address_detail/msc_nilufar_akbari/)
- Bach, Paul
- Becher, Heiko [(Universitätsklinikum Heidelberg
)](https://www.klinikum.uni-heidelberg.de/heidelberger-institut-fuer-global-health/groups-projects/working-groups/epidemiology-and-biostatistics/staffmem/members/becherh)
- Dunkler, Daniela [(Medical University of
Vienna)](https://www.meduniwien.ac.at/researcher/daniela_dunkler)
- Gregorich, Mariella [(Medical University of
Vienna)](https://www.meduniwien.ac.at/researcher/mariella_gregorich)
- Hafermann, Lorena [(Charité Universitätsmedizin
Berlin)](https://biometrie.charite.de/metas/person/person/address_detail/lorena_hafermann/)
- Heinze, Georg [(Medical University of
Vienna)](https://www.meduniwien.ac.at/researcher/georg_heinze)
- Herrmann, Carolin [(Charité Universitätsmedizin
Berlin)](https://biometrie.charite.de/metas/person/person/address_detail/phd_carolin_herrmann/)
- Kammer, Michael [(Medical University of
Vienna)](https://www.meduniwien.ac.at/researcher/kammer_michael)
- Madern, Moritz (Medical University of Vienna)
- Pamminger, Moritz (Medical University of Vienna)
- Rauch, Geraldine [Technische Universität
Berlin](https://www.tu.berlin/ueber-die-tu-berlin/organisation/universitaetsleitung/praesidentin)
- Schilhart-Wallisch, Christine (Agentur für Gesundheit und
Ernährungssicherheit, Wien)
- Ullmann, Theresa [(Medical University of
Vienna)](https://www.meduniwien.ac.at/researcher/theresa_ullmann)
## SAMBA Workshop 2023 in Vienna
In Nov 2023, we organized a two-day [SAMBA workshop](workshop.qmd#sec.workshop)
at the Medical University of Vienna. Two international speakers - Maarten van
Smeden and Tobias Pischon - were invited and gave the keynote speeches. All
SAMBA participants presented results from SAMBA projects and a panel discussion
on the topic *Complex Longitudinal Studies: Needs and Challenges in the
Interaction of Biostatistics and Epidemiology* completed the workshop. A virtual
participation of the public part of the workshop was possible to allow
international participation. In the end, 56 participants attended the workshop
on-site and 82 followed the event online.

## Selected output of the SAMBA project
### Peer-reviewed publications
1. [Bach, P, Wallisch C, Klein N, Hafermann L, Sauerbrei W, Steyerberg E W,
Heinze G, Rauch G, and Stratos initiative for topic group 2: Systematic
review of education and practical guidance on regression modeling for
medical researchers who lack a strong statistical background: Study
protocol. PLoS ONE 15, Nr. 12 (2020):
e0241427](https://doi.org/10.1371/journal.pone.0241427).\
*Abstract:*\
In the last decades, statistical methodology has developed rapidly, in
particular in the field of regression modeling. Multivariable regression
models are applied in almost all medical research projects. Therefore, the
potential impact of statistical misconceptions within this field can be
enormous Indeed, the current theoretical statistical knowledge is not always
adequately transferred to the current practice in medical statistics. Some
medical journals have identified this problem and published isolated
statistical articles and even whole series thereof. In this systematic
review, we aim to assess the current level of education on regression
modeling that is provided to medical researchers via series of statistical
articles published in medical journals. The present manuscript is a protocol
for a systematic review that aims to assess which aspects of regression
modeling are covered by statistical series published in medical journals
that intend to train and guide applied medical researchers with limited
statistical knowledge. Statistical paper series cannot easily be summarized
and identified by common keywords in an electronic search engine like
Scopus. We therefore identified series by a systematic request to
statistical experts who are part or related to the STRATOS Initiative
(STRengthening Analytical Thinking for Observational Studies). Within each
identified article, two raters will independently check the content of the
articles with respect to a predefined list of key aspects related to
regression modeling. The content analysis of the topic-relevant articles
will be performed using a predefined report form to assess the content as
objectively as possible. Any disputes will be resolved by a third reviewer.
Summary analyses will identify potential methodological gaps and
misconceptions that may have an important impact on the quality of analyses
in medical research. This review will thus provide a basis for future
guidance papers and tutorials in the field of regression modeling which will
enable medical researchers 1) to interpret publications in a correct way, 2)
to perform basic statistical analyses in a correct way and 3) to identify
situations when the help of a statistical expert is required.
2. [Wallisch C, Dunkler D, Rauch G, de Bin R, Heinze G: Selection of variables
for multivariable models: Opportunities and limitations in quantifying model
stability by resampling. Statistics in Medicine (2021)
40:369-381](https://doi.org/10.1002/sim.8779).\
*Abstract:* \
Statistical models are often fitted to obtain a concise description of the
associ- ation of an outcome variable with some covariates. Even if
background knowledge is available to guide preselection of covariates,
stepwise variable selection is commonly applied to remove irrelevant ones.
This practice may introduce additional variability and selection is rarely
certain. However, these issues are often ignored and model stability is not
questioned. Several resampling-based measures were proposed to describe
model stability,including variable inclusion frequencies (VIFs), model
selection frequencies, relative conditional bias (RCB), and root mean
squared difference ratio (RMSDR). The latter two were recently proposed to
assess bias and variance inflation induced by variable selection. Here, we
study the consistency and accuracy of resampling estimates of these measures
and the optimal choice of the resampling technique. In particular, we
compare subsampling and bootstrapping for assessing stability of linear,
logistic, and Cox models obtained by backward elimination in a simulation
study. Moreover, we exemplify the estimation and interpretation of all
suggested measures in a study on cardiovascular risk. The VIF and the model
selection frequency are only consistently estimated in the subsampling
approach. By contrast, the bootstrap is advantageous in terms of bias and
precision for estimating the RCB as well as the RMSDR. Though, unbiased
estimation of the latter quantity requires independence of covariates, which
is rarely encountered in practice. Our study stresses the importance of
addressing model stability after variable selection and shows how to cope
with it.
3. [Gregorich M, Strohmaier S, Dunkler D, Heinze G: Regression with Highly
Correlated Predictors: Variable Omission Is Not the Solution. International
Journal of Environmental Research and Public Health (2021)
18(8)](https://doi.org/10.3390/ijerph18084259).\
*Abstract:* \
Regression models have been in use for decades to explore and quantify the
association between a dependent response and several independent variables
in environmental sciences, epidemiology and public health. However,
researchers often encounter situations in which some independent variables
exhibit high bivariate correlation, or may even be collinear. Improper
statistical handling of this situation will most certainly generate models
of little or no practical use and misleading interpretations. By means of
two example studies, we demonstrate how diagnostic tools for collinearity or
near-collinearity may fail in guiding the analyst. Instead, the most
appropriate way of handling collinearity should be driven by the research
question at hand and, in particular, by the distinction between predictive
or explanatory aims.
4. [Hafermann L, Becher H, Herrmann C, Klein N, Heinze G, Rauch G: Statistical
Model Building: Background "Knowledge" Based on Inappropriate Preselection
Causes Misspecification. BMC Med Res Methodol (2021)
21(1):196](https://doi.org/10.1186/s12874-021-01373-z).\
*Abstract:*\
Background: Statistical model building requires selection of variables for a
model depending on the model's aim. In descriptive and explanatory models, a
common recommendation often met in the literature is to include all
variables in the model which are assumed or known to be associated with the
outcome independent of their identification with data driven selection
procedures. An open question is, how reliable this assumed "background
knowledge" truly is. In fact, "known" predictors might be findings from
preceding studies which may also have employed inappropriate model building
strategies.\
Methods: We conducted a simulation study assessing the influence of treating
variables as "known predictors" in model building when in fact this
knowledge resulting from preceding studies might be insufficient. Within
randomly generated preceding study data sets, model building with variable
selection was conducted. A variable was subsequently considered as a "known"
predictor if a predefined number of preceding studies identified it as
relevant.\
Results: Even if several preceding studies identified a variable as a "true"
predictor, this classification is often false positive. Moreover, variables
not identified might still be truly predictive. This especially holds true
if the preceding studies employed inappropriate selection methods such as
univariable selection.\
Conclusions: The source of "background knowledge" should be evaluated with
care. Knowledge generated on preceding studies can cause misspecification.
5. [Wallisch C, Agibetov A, Dunkler D, Haller M, Samwald M, Dorffner G, Heinze
G: The Roles of Predictors in Cardiovascular Risk Models - a Question of
Modeling Culture? BMC Medical Research Methodology (2021)
21(1):284](https://doi.org/10.1186/s12874-021-01487-4).\
*Abstract:*\
Background: While machine learning (ML) algorithms may predict
cardiovascular outcomes more accurately than statistical models, their
result is usually not representable by a transparent formula. Hence, it is
often unclear how specific values of predictors lead to the predictions. We
aimed to demonstrate with graphical tools how predictor-risk relations in
cardiovascular risk prediction models fitted by ML algorithms and by
statistical approaches may differ, and how sample size affects the stability
of the estimated relations.\
Methods: We reanalyzed data from a large registry of 1.5 million
participants in a national health screening program. Three data analysts
developed analytical strategies to predict cardiovascular events within 1
year from health screening. This was done for the full data set and with
gradually reduced sample sizes, and each data analyst followed their
favorite modeling approach. Predictor-risk relations were visualized by
partial dependence and individual conditional expectation plots.\
Results: When comparing the modeling algorithms, we found some similarities
between these visualizations but also occasional divergence. The smaller the
sample size, the more the predictor-risk relation depended on the modeling
algorithm used, and also sampling variability played an increased role.
Predictive performance was similar if the models were derived on the full
data set, whereas smaller sample sizes favored simpler models.\
Conclusion: Predictor-risk relations from ML models may differ from those
obtained by statistical models, even with large sample sizes. Hence,
predictors may assume different roles in risk prediction models. As long as
sample size is sufficient, predictive accuracy is not largely affected by
the choice of algorithm.
6. [Hafermann L, Klein N, Rauch G, Kammer M, Heinze G: Using Background
Knowledge from Preceding Studies for Building a Random Forest Prediction
Model: A Plasmode Simulation Study. Entropy (Basel) (2022)
24(6)](https://doi.org/10.3390/e24060847).\
*Abstract:* \
There is an increasing interest in machine learning (ML) algorithms for
predicting patient outcomes, as these methods are designed to automatically
discover complex data patterns. For example, the random forest (RF)
algorithm is designed to identify relevant predictor variables out of a
large set of candidates. In addition, researchers may also use external
information for variable selection to improve model interpretability and
variable selection accuracy, thereby prediction quality. However, it is
unclear to which extent, if at all, RF and ML methods may benefit from
external information. In this paper, we examine the usefulness of external
information from prior variable selection studies that used traditional
statistical modeling approaches such as the Lasso, or suboptimal methods
such as univariate selection. We conducted a plasmode simulation study based
on subsampling a data set from a pharmacoepidemiologic study with nearly
200,000 individuals, two binary outcomes and 1152 candidate predictor
(mainly sparse binary) variables. When the scope of candidate predictors was
reduced based on external knowledge RF models achieved better calibration,
that is, better agreement of predictions and observed outcome rates.
However, prediction quality measured by cross-entropy, AUROC or the Brier
score did not improve. We recommend appraising the methodological quality of
studies that serve as an external information source for future prediction
model development.
7. [Kammer M, Dunkler D, Michiels S, Heinze G: Evaluating Methods for Lasso
Selective Inference in Biomedical Research: A Comparative Simulation Study.
BMC Medical Research Methodology (2022)
22(1):206](https://doi.org/10.1186/s12874-022-01681-y).\
*Abstract:*\
Background: Variable selection for regression models plays a key role in the
analysis of biomedical data. However, inference after selection is not
covered by classical statistical frequentist theory, which assumes a fixed
set of covariates in the model. This leads to over-optimistic selection and
replicability issues.\
Methods: We compared proposals for selective inference targeting the
submodel parameters of the Lasso and its extension, the adaptive Lasso:
sample splitting, selective inference conditional on the Lasso selection
(SI), and universally valid post-selection inference (PoSI). We studied the
properties of the proposed selective confidence intervals available via R
software packages using a neutral simulation study inspired by real data
commonly seen in biomedical studies. Furthermore, we present an exemplary
application of these methods to a publicly available dataset to discuss
their practical usability.\
Results: Frequentist properties of selective confidence intervals by the SI
method were generally acceptable, but the claimed selective coverage levels
were not attained in all scenarios, in particular with the adaptive Lasso.
The actual coverage of the extremely conservative PoSI method exceeded the
nominal levels, and this method also required the greatest computational
effort. Sample splitting achieved acceptable actual selective coverage
levels, but the method is inefficient and leads to less accurate point
estimates. The choice of inference method had a large impact on the
resulting interval estimates, thereby necessitating that the user is acutely
aware of the goal of inference in order to interpret and communicate the
results.\
Conclusions: Despite violating nominal coverage levels in some scenarios,
selective inference conditional on the Lasso selection is our recommended
approach for most cases. If simplicity is strongly favoured over efficiency,
then sample splitting is an alternative. If only few predictors undergo
variable selection (i.e. up to 5) or the avoidance of false positive claims
of significance is a concern, then the conservative approach of PoSI may be
useful. For the adaptive Lasso, SI should be avoided and only PoSI and
sample splitting are recommended. In summary, we find selective inference
useful to assess the uncertainties in the importance of individual selected
predictors for future applications.
8. [Wallisch C, Bach P, Hafermann L, Klein N, Sauerbrei W, Steyerberg E W,
Heinze G, Rauch G and Stratos initiative topic group 2. Review of guidance
papers on regression modeling in statistical series of medical journals.
PLoS ONE 17, Nr. 1 (2022):
e0262918.](https://doi.org/10.1371/journal.pone.0262918).\
*Abstract:* \
Although regression models play a central role in the analysis of medical
research projects, there still exist many misconceptions on various aspects
of modeling leading to faulty analyses. Indeed, the rapidly developing
statistical methodology and its recent advances in regression modeling do
not seem to be adequately reflected in many medical publications. This
problem of knowledge transfer from statistical research to application was
identified by some medical journals, which have published series of
statistical tutorials and (shorter) papers mainly addressing medical
researchers. The aim of this review was to assess the current level of
knowledge with regard to regression modeling contained in such statistical
papers. We searched for target series by a request to international
statistical experts. We identified 23 series including 57 topic-relevant
articles. Within each article, two independent raters analyzed the content
by investigating 44 predefined aspects on regression modeling. We assessed
to what extent the aspects were explained and if examples, software advices,
and recommendations for or against specific methods were given. Most series
(21/23) included at least one article on multivariable regression. Logistic
regression was the most frequently described regression type (19/23),
followed by linear regression (18/23), Cox regression and survival models
(12/23) and Poisson regression (3/23). Most general aspects on regression
modeling, e.g. model assumptions, reporting and interpretation of regression
results, were covered. We did not find many misconceptions or misleading
recommendations, but we identified relevant gaps, in particular with respect
to addressing nonlinear effects of continuous predictors, model
specification and variable selection. Specific recommendations on software
were rarely given. Statistical guidance should be developed for nonlinear
effects, model specification and variable selection to better support
medical researchers who perform or interpret regression analyses.
9. [Akbari N, Heinze G, Rauch G, Sander B, Becher H, Dunkler D: Causal Model
Building in the Context of Cardiac Rehabilitation: A Systematic Review.
International Journal of Environmental Research and Public Health (2023)
20(4):3182](https://doi.org/10.3390/ijerph20043182).\
*Abstract:* \
Randomization is an effective design option to prevent bias from confounding
in the evaluation of the causal effect of interventions on outcomes.
However, in some cases, randomization is not possible, making subsequent
adjustment for confounders essential to obtain valid results. Several
methods exist to adjust for confounding, with multivariable modeling being
among the most widely used. The main challenge is to determine which
variables should be included in the causal model and to specify appropriate
functional relations for continuous variables in the model. While the
statistical literature gives a variety of recommendations on how to build
multivariable regression models in practice, this guidance is often unknown
to applied researchers. We set out to investigate the current practice of
explanatory regression modeling to control confounding in the field of
cardiac rehabilitation, for which mainly non-randomized observational
studies are available. In particular, we conducted a systematic methods
review to identify and compare statistical methodology with respect to
statistical model building in the context of the existing recent systematic
review CROS-II, which evaluated the prognostic effect of cardiac
rehabilitation. CROS-II identified 28 observational studies, which were
published between 2004 and 2018. Our methods review revealed that 24 (86%)
of the included studies used methods to adjust for confounding. Of these, 11
(46%) mentioned how the variables were selected and two studies (8%)
considered functional forms for continuous variables. The use of background
knowledge for variable selection was barely reported and data-driven
variable selection methods were applied frequently. We conclude that in the
majority of studies, the methods used to develop models to investigate the
effect of cardiac rehabilitation on outcomes do not meet common criteria for
appropriate statistical model building and that reporting often lacks
precision.
10. [Ullmann T, Heinze G, Hafermann L, Schilhart-Wallisch C, Dunkler D, for TG2
of the STRATOS initiative. „Evaluating Variable Selection Methods for
Multivariable Regression Models: A Simulation Study Protocol“. PLOS ONE 19,
Nr. 8 (2024): e0308543.](https://doi.org/10.1371/journal.pone.0308543).\
*Abstract:* \
Researchers often perform data-driven variable selection when modeling the
associations between an outcome and multiple independent variables in
regression analysis. Variable selection may improve the interpretability,
parsimony and/or predictive accuracy of a model. Yet variable selection can
also have negative consequences, such as false exclusion of important
variables or inclusion of noise variables, biased estimation of regression
coefficients, underestimated standard errors and invalid confidence
intervals, as well as model instability. While the potential advantages and
disadvantages of variable selection have been discussed in the literature
for decades, few large-scale simulation studies have neutrally compared
data-driven variable selection methods with respect to their consequences
for the resulting models. We present the protocol for a simulation study
that will evaluate different variable selection methods: forward selection,
stepwise forward selection, backward elimination, augmented backward
elimination, univariable selection, univariable selection followed by
backward elimination, and penalized likelihood approaches (Lasso, relaxed
Lasso, adaptive Lasso). These methods will be compared with respect to false
inclusion and/or exclusion of variables, consequences on bias and variance
of the estimated regression coefficients, the validity of the confidence
intervals for the coefficients, the accuracy of the estimated variable
importance ranking, and the predictive performance of the selected models.
We consider both linear and logistic regression in a low-dimensional setting
(20 independent variables with 10 true predictors and 10 noise variables).
The simulation will be based on real-world data from the National Health and
Nutrition Examination Survey (NHANES). Publishing this study protocol ahead
of performing the simulation increases transparency and allows integrating
the perspective of other experts into the study design.
### Educational shiny app
An educational shiny app to explore the use of fractional polynomials, B-splines
and natural splines, to estimate non-linear associations of an outcome of
interest with a continuous explanatory variable can be found at
<https://clinicalbiometrics.shinyapps.io/Bendyourspline>.
## Funding
FWF Austrian Science Fund:
[](https://www.fwf.ac.at/en/)\
\
\
DFG Deutsche Forschungsgemeinschaft:
[](https://www.dfg.de/en/)