Add covariate builder that uses cohorts to build (binary) features #96

schuemie · 2020-06-03T08:15:37Z

The default covariate builder is mostly based on the occurrence of concepts (or their ancestors), We have a covariate builder based on cohort attributes, but not one based on cohorts.

The builder could create binary features based on the occurrence of a user-defined set of cohorts in some user-specified time window,

gowthamrao · 2020-06-03T10:58:07Z

@schuemie in this case, we will have four date combinations

two cohort_start_date - one for the cohort of interest (c), and other for feature cohort (f).
two cohort_end_date - one for the cohort of interest (c), and other for feature cohort (f).

Features may then be constructed based on

f.cohort_start_date is
-- +/- x-days c.cohort_start_date
-- +/- x-days c.cohort_end_date
-- +/- x-days c.cohort_start_date & +/- x-days c.cohort_end_date
f.cohort_end_date is
-- +/- x-days c.cohort_start_date
-- +/- x-days c.cohort_end_date
-- +/- x-days c.cohort_start_date & +/- x-days c.cohort_end_date
both f.cohort_start_date and f.cohort_end_date is
-- +/- x-days c.cohort_start_date
-- +/- x-days c.cohort_end_date
-- +/- x-days c.cohort_start_date & +/- x-days c.cohort_end_date

Plus - we may want to support observation_period in relation to f.cohort dates, or c.cohort dates.

Do you anticipate the covariate builder to support all these scenarios?

schuemie · 2020-06-03T11:52:33Z

No, I was thinking of adhering to the current pattern used in FeatureExtraction, so allowing the user to specify 1 start and 1 end date relative to the index date (= start of the cohort of interest), and maybe an option to choose if the feature cohort start should be in the lookback window, or whether there needs to be overlap with the lookback window.

jreps · 2020-06-04T13:25:53Z

I have code in the PLP skeletons - we do this for the simple models: https://github.com/OHDSI/StudyProtocolSandbox/blob/master/SkeletonPredictionStudy/R/CohortCovariateCode.R (it even ables counts rather than binary) in the recent studied I also included age interaction: https://github.com/ohdsi-studies/Covid19PredictionStudies/blob/master/CovidVulnerabilityIndex/R/CohortCovariateCode.R

This is an example of running it: https://github.com/OHDSI/PredictionComparison/blob/andromeda/R/atriaModel.R

javier-gracia-tabuenca-tuni · 2021-01-15T08:23:11Z

I think I solved this by duplicating DomainConcept.sql, and replacing domain_table with cohort_table.
Then, I created a temporal table with the list of cohort_ids and cohort_names that is used to calculate the overlap and name the covariates.

Modified DomainConcept.sql, named to CohortOverlap.sql
CohortOverlap.sql.txt

Can be used like

# cohortSetReference  as the table used in CohortDiagnostics 
# cohorts defined in atlas-demo.ohdsi.org
cohortSetReference <- tibble(
  atlasId = c( 1776012, 1776013, 1776018  ) , 
  atlasName =c( "Asthma", "Last condition", "First condition" ),
  cohortId = c( 1776012, 1776013, 1776018  ),
  name = c("Asthma",  "Last condition", "First condition" )
)

# create temp table in server 
overlapCohortsTable_name <- "tmp_table_cohorts_overlap"
overlapCohortsTable <- cohortSetReference %>% filter(cohortId %in% c("1776013", "1776018"))

DatabaseConnector::insertTable(
  connection,
  tableName = overlapCohortsTable_name,
  data = as.data.frame(overlapCohortsTable),
  dropTableIfExists = TRUE,
  createTable = TRUE,
  tempTable = FALSE,
  oracleTempSchema = oracleTempSchema)


cohortOverlap <- createAnalysisDetails(
  analysisId = 51,
  sqlFileName = "CohortOverlap.sql",
  parameters = list(
    analysisId = 51,
    analysisName = "Cohort overlap", 
    domainId = "Cohort overlap", 
    #domainTable = cohortTable, 
    #domainConceptId = "cohort_definition_id",
    domain_start_date = "cohort_start_date",
    domain_end_date = "cohort_end_date", 
    #
    cohort_ids = "1776013, 1776018", 
    overlap_cohorts_table = overlapCohortsTable_name
  ),
  includedCovariateConceptIds = c(),
  addDescendantsToInclude = FALSE,
  excludedCovariateConceptIds = c(),
  addDescendantsToExclude = FALSE,
  includedCovariateIds = c()
)
  


detTempCovSet <- createDetailedTemporalCovariateSettings(
  analyses = list(
    cohortOverlap
  ), 
  temporalStartDays = c(-365,   0,  365*5,   365*10,  365*20),
  temporalEndDays = c(  -1, 365*5, 365*10,   365*20,   365*50)
)

I can write a more reproducible example if you need it.

schuemie · 2021-04-27T08:50:50Z

@jreps 's code has a very nice implementation that allows you to create different kinds of features of the same cohort, like binary, counts, etc. We should reuse that here.

schuemie · 2022-05-25T13:57:57Z

I'll start working on this, as I need it for a project.

anthonysena · 2022-05-25T14:45:38Z

OK thanks for the head's up here @schuemie. Do you plan to use the implementation from PLP as you suggested in the earlier comment? I am just curious about the design at a high level.

schuemie · 2022-05-26T09:47:58Z

Here's my initial version: fddbdd6

Some of the thinking so far:

Two types of covariates: binary and cohort count. These should cover a lot of use cases. I did not implement @jrep's interactions with age. I'm not sure interactions should be implemented for each covariate separately. Instead, we could have some post-processor that can create interaction terms from existing covariates.
Support for all the different flavors of covariates: non-aggregated or aggregated, temporal or non-temporal (not yet temporal-sequence, may leave that for later). I think we need all these flavors across HADES. (e.g. temporal aggregated is used by CohortDiagnostics).
Although in the covariate settings the user can specify the table and database schema for the cohorts used to derive the covariates, that is a bit awkward as it mixes analysis specifications with execution specifications. The default is therefore to leave these empty. The covariate builder automatically uses the same cohort table as used for the main cohorts. I can modify CohortMethod to set these to the same table where the exposure cohorts live before the settings reach the covariate builder. PLP could do the same.
I'm currently using the 'detailed covariate settings' functions to avoid having to modify getDbDefaultCovariateData().

schuemie added enhancement help wanted labels Jun 3, 2020

anthonysena self-assigned this Oct 27, 2020

schuemie self-assigned this May 25, 2022

schuemie mentioned this issue Jun 3, 2022

Cohort covariates #167

Merged

anthonysena added this to the v3.3.0 milestone Jun 20, 2022

anthonysena mentioned this issue Jun 20, 2022

Grouping Variables #28

Closed

anthonysena linked a pull request Jan 10, 2023 that will close this issue

Cohort covariates #167

Merged

ginberg self-assigned this May 9, 2023

anthonysena closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add covariate builder that uses cohorts to build (binary) features #96

Add covariate builder that uses cohorts to build (binary) features #96

schuemie commented Jun 3, 2020

gowthamrao commented Jun 3, 2020

schuemie commented Jun 3, 2020

jreps commented Jun 4, 2020 •

edited

Loading

javier-gracia-tabuenca-tuni commented Jan 15, 2021

schuemie commented Apr 27, 2021

schuemie commented May 25, 2022

anthonysena commented May 25, 2022

schuemie commented May 26, 2022

Add covariate builder that uses cohorts to build (binary) features #96

Add covariate builder that uses cohorts to build (binary) features #96

Comments

schuemie commented Jun 3, 2020

gowthamrao commented Jun 3, 2020

schuemie commented Jun 3, 2020

jreps commented Jun 4, 2020 • edited Loading

javier-gracia-tabuenca-tuni commented Jan 15, 2021

schuemie commented Apr 27, 2021

schuemie commented May 25, 2022

anthonysena commented May 25, 2022

schuemie commented May 26, 2022

jreps commented Jun 4, 2020 •

edited

Loading