Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort covariates #167

Merged
merged 15 commits into from
Jun 28, 2023
Merged

Cohort covariates #167

merged 15 commits into from
Jun 28, 2023

Conversation

schuemie
Copy link
Member

@schuemie schuemie commented Jun 3, 2022

Addresses this issue.

Adds covariates based on other cohorts. These can be binary or 'count', meaning the count of cohort occurrences within the time window. Both types are available per-person and aggregated. It is also possible to construct temporal covariates, which are always binary. The temporal covariates can also be fetched as aggregated statistics.

Included are unit tests and a vignette.

I believe cohort-based covariates are an important step in moving HADES away from custom coding per study (specifically as part of Strategus).

@schuemie schuemie requested a review from anthonysena June 3, 2022 12:52
@codecov
Copy link

codecov bot commented Jun 3, 2022

Codecov Report

Merging #167 (087a745) into develop (1e46b67) will increase coverage by 0.10%.
The diff coverage is 94.24%.

❗ Current head 087a745 differs from pull request most recent head 6aa0a4c. Consider uploading reports for the commit 6aa0a4c to get more accurate results

@@             Coverage Diff             @@
##           develop     #167      +/-   ##
===========================================
+ Coverage    93.30%   93.40%   +0.10%     
===========================================
  Files           15       16       +1     
  Lines         1165     1304     +139     
===========================================
+ Hits          1087     1218     +131     
- Misses          78       86       +8     
Impacted Files Coverage Δ
R/FeatureExtraction.R 60.00% <ø> (ø)
R/HelperFunctions.R 88.88% <40.00%> (-6.12%) ⬇️
R/GetCovariatesFromOtherCohorts.R 96.26% <96.26%> (ø)

... and 10 files with indirect coverage changes

Copy link
Collaborator

@anthonysena anthonysena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing these changes, I just wanted to confirm that the desired behavior here. From my review, my understanding is that we aim to detect if a target cohort table entry has a matching cohort_covariate for the subject_id and for the #covariate_cohort_ref.cohort_definition_id specified. To do this, the cohort.cohort_start_date is adjusted to see if covariate_cohort.cohort_start_date the that falls in a specified time window. For temporal, this time window will use the #time_ref to adjust the cohort_start_date of the target cohort to see if the cohort_covariate.cohort_start_date falls in that range and if it is "non-temporal" it will use the @start_day and @end_day respectively.

Just wanted to confirm this behavior and add a few more questions/comments before approving.

temporalEndDays = -365:-1,
includedCovariateIds = c()) {
errorMessages <- checkmate::makeAssertCollection()
checkmate::assertInt(analysisId, lower = 1, upper = 999, add = errorMessages)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider ensuring that the analysisId specified does not collide with pre-specified analysis IDs? This would apply for creating the temporal & non-temporal settings objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but I'm not sure at what point we can do that. Right now, multiple covariate settings are simply combined in a generic list, like here. That means the first place we can look for collisions is in the getDbCovariateData() function (for example here. It would definitely be a good idea to add a check there, because throwing an error is better than generating different covariates with the same covariate ID, which would lead to strange but perhaps undetectable behavior. Ideally we'd be able to warn the user earlier, but I don't see when.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR so the createSettings functions throw a warning when the analysis ID is also used for one of the pre-specified analyses.

In addition, I recommend somewhere in the future we make it so getDbCovariateData() throws an error if there are overlapping covariate IDs between covariate builders. I've created an issue for this.

@schuemie
Copy link
Member Author

schuemie commented Jun 7, 2022

Yes, I believe your summary is correct.

@@ -76,20 +75,20 @@ public class FeatureExtraction {
private static String ADD_DESCENDANTS_SQL = "SELECT descendant_concept_id AS id\nINTO @target_temp\nFROM @cdm_database_schema.concept_ancestor\nINNER JOIN @source_temp\n\tON ancestor_concept_id = id;\n\n";

public static void main(String[] args) {
//init("C:/Users/mschuemi/git/FeatureExtraction/inst");
init("C:/Users/mschuemi/git/FeatureExtraction/inst");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that look right @schuemie

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function that is used by the Java developer for testing. It is not used by the package. So yes, seems right.

@gowthamrao
Copy link
Member

@anthonysena is this mature for use?

@anthonysena
Copy link
Collaborator

You can use it at your own risk - I reviewed this some time ago but I need to find some time to do a deeper dive and merge it into develop.

@gowthamrao
Copy link
Member

OK - i am going to give it a try to generate some content for these two forum posts

https://forums.ohdsi.org/t/phenotype-submission-transverse-myelitis/17769/6
https://forums.ohdsi.org/t/phenotype-submission-symptoms-and-signs/17895

@gowthamrao
Copy link
Member

I am curious to know, why valueType = "binary", parameter is supported for createCohortBasedCovariateSettings but not createCohortBasedTemporalCovariateSettings

@schuemie
Copy link
Member Author

schuemie commented Dec 8, 2022

No strong reason. I think at the time I imagined most people would use 1-day windows for temporal covariates, and you can only be in a cohort once per day. But of course you could have bigger windows.

Allowing count type temporal covariates would require some additions to the SQL.

@schuemie
Copy link
Member Author

schuemie commented Dec 8, 2022

Now added: 4aa04e4

@gowthamrao
Copy link
Member

gowthamrao commented Dec 9, 2022

@schuemie i ran the script and this Pull Request I think is going to be very valuable.

Example: I ran the branch on the cohorts in OHDSI Phenotype library. I selected cohortId = 19 which is 'Typical Pneumonia' and a set of Cohorts that maybe symptoms

cohortDefinitionSet <- PhenotypeLibrary::getPlCohortDefinitionSet(cohortIds = c(19,
                                                                                3, 5, 6, 10, 11, 13, 14, 20, 23))

cohortId cohortName

1 3 [P] Cough or Sputum (3Ps, 30Era)
2 5 [P] Dyspnea (7Pe, 90Era)
3 6 [P] Fever (3Pe, 30Era)
4 10 [P] Nausea or Vomiting (3Pe, 30Era)
5 19 [P] Acute Typical Pneumonia (14Pe, 30Era)
6 20 [P] Bronchitis (14Pe, 30Era)

I then created a composite cohort using CohortAlgebra

CohortAlgebra::unionCohorts(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = cdmSource$cohortDatabaseSchemaFinal,
  cohortTable = paste0(
    stringr::str_squish("pl_"),
    stringr::str_squish(cdmSource$sourceKey)
  ),
  oldToNewCohortId = dplyr::tibble(
    oldCohortId = c(3, 5, 6, 10, 20),
    newCohortId = 9999999
  ), 
  purgeConflicts = TRUE
)

and then used all the symptom cohorts and the composite cohort as a covariate

covariateSettings <-
FeatureExtraction::createCohortBasedTemporalCovariateSettings(
analysisId = 999,
covariateCohorts = cohortDefinitionSet %>%
dplyr::filter(cohortId != 19),
temporalStartDays = c(-365,-30, 0, 1, 31,-9999,-7),
temporalEndDays = c(-31,-1, 0, 30, 365, 9999,-1)
)

covariateData <-
FeatureExtraction::getDbCovariateData(
connection = DatabaseConnector::connect(connectionDetails),
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = cohortTable,
cohortId = 19,
rowIdField = "subject_id",
covariateSettings = covariateSettings,
aggregated = TRUE
)

I executed this on a US claims based data source

and got these results

image

This passes my "sniff test". i.e. the cohorts of dyspnea, cough, bronchitis are ranked higher in the typical pneumonia cohort. The composite of all these relevant symptoms and signs was found in 58% of the cohort.

@gowthamrao
Copy link
Member

Now, i want to understand how we can represent the relationship between the two cohorts (target cohort and covariate cohort) in terms of Allen's interval algebra.

https://en.wikipedia.org/wiki/Allen%27s_interval_algebra

Based on my review of this SQL

{@temporal} ? {
INNER JOIN #time_period time_period
ON covariate_cohort.cohort_start_date <= DATEADD(DAY, time_period.end_day, cohort.cohort_start_date)
WHERE CASE WHEN covariate_cohort.cohort_end_date IS NULL THEN covariate_cohort.cohort_start_date ELSE covariate_cohort.cohort_end_date END >= DATEADD(DAY, time_period.start_day, cohort.cohort_start_date)
} : {
WHERE covariate_cohort.cohort_start_date <= DATEADD(DAY, {@temporal_sequence} ? {@sequence_end_day} : {@end_day}, cohort.cohort_start_date)
{@start_day != 'anyTimePrior'} ? {
AND CASE WHEN covariate_cohort.cohort_end_date IS NULL THEN covariate_cohort.cohort_start_date ELSE covariate_cohort.cohort_end_date END >= DATEADD(DAY, {@temporal_sequence} ? {@sequence_start_day} : {@start_day}, cohort.cohort_start_date)
}

is logically equivalent to (i got this using SqlRender)

INNER JOIN #time_period time_period
ON covariate_cohort.cohort_start_date <= DATEADD(DAY, time_period.end_day, cohort.cohort_start_date)
WHERE covariate_cohort.cohort_end_date >= DATEADD(DAY, time_period.start_day, cohort.cohort_start_date)

which becomes
cov_cohort_start_date <= target_cohort_start_date AND cov_cohort_end_date >= target_cohort_start_date

@gowthamrao
Copy link
Member

gowthamrao commented Dec 9, 2022

Which in Allen's interval algebra form would be ANY of the four relationships between x and y

x meets y (or inverse - y is met by x)
x overlaps with y (or inverse - y is overlapped by x)
x starts y (or inverse - y is started by x)
x equals y (no inverse)

is that interpretation correct?

image

@schuemie
Copy link
Member Author

schuemie commented Dec 9, 2022

@gowthamrao: you're leaving out the window part (time_period.start_day and time_period.end_day). That SQL can be interpreted as:

covariate_cohort_start <= window_end AND covariate_cohort_end >= window_start

(remember that both the window start and end are defined relative to the target cohort start)

This means that my current implementation sets the feature to value 1 for a window if there is at least 1 day in the window when someone was in the covariate cohort.

@gowthamrao
Copy link
Member

gowthamrao commented Dec 9, 2022

you're leaving out the window part (time_period.start_day and time_period.end_day).

So may be lets go thru scenarios: Lets start with time_period.start_day = 0 and time_period.end_day = 0, given this

is this accuraate

  1. temporal relationship is

cov_cohort_start_date <= target_cohort_start_date AND cov_cohort_end_date >= target_cohort_start_date

  1. which in Allen's interval algebra ontology becomes

would be ANY of the four relationships between x and y

x meets y (or inverse - y is met by x)
x overlaps with y (or inverse - y is overlapped by x)
x starts y (or inverse - y is started by x)
x equals y (no inverse)

@schuemie
Copy link
Member Author

schuemie commented Dec 9, 2022

I think that is correct, except 'x meets y' seems to imply there's no overlap?

@gowthamrao
Copy link
Member

I think that is correct, except 'x meets y' seems to imply there's no overlap?

image

here cov_cohort_end_date = target_cohort_start_date i think

that would satisfy cov_cohort_start_date <= target_cohort_start_date AND cov_cohort_end_date >= target_cohort_start_date ?

@gowthamrao
Copy link
Member

@schuemie in my testing, i wanted to compute features using both cohort covariates and feature covariates for the same set of cohorts. These are my temporal covariate settings. My question is - can i combine these temporalCovariateSettings into one - so that I get one result set with results from both featureCovariates and cohortCovariates, or should i execute them serially one after the other, and then try to combine the results outside FeatureExtraction? I think i would prefer to be able to combine them - so that i get one result set.

# this is from CohortDiagnostics - its a temporalCovariateSetting
cohortDiagnosticsCovariateSettings <-
  CohortDiagnostics::getDefaultCovariateSettings()

this is new, using createCohortBasedTemporalCovariateSettings

cohortBasedCovariateSettings <-
  FeatureExtraction::cohortBasedCovariateSettings <-
  FeatureExtraction::createCohortBasedTemporalCovariateSettings(
    analysisId = 150,
    covariateCohortDatabaseSchema = covariateCohortDatabaseSchema,
    covariateCohortTable = covariateCohortTable,
    covariateCohorts = PhenotypeLibrary::getPhenotypeLog(cohortIds = c(1:10)) |>
      dplyr::select(cohortId, cohortName),
    valueType = "binary",
    temporalStartDays = cohortDiagnosticsCovariateSettings$temporalStartDays,
    temporalEndDays = cohortDiagnosticsCovariateSettings$temporalEndDays
  )(
    analysisId = 150,
    covariateCohortDatabaseSchema = covariateCohortDatabaseSchema,
    covariateCohortTable = covariateCohortTable,
    covariateCohorts = PhenotypeLibrary::getPhenotypeLog(cohortIds = c(1:10)) |>
      dplyr::select(cohortId, cohortName),
    valueType = "binary",
    temporalStartDays = cohortDiagnosticsCovariateSettings$temporalStartDays,
    temporalEndDays = cohortDiagnosticsCovariateSettings$temporalEndDays
  )

@schuemie
Copy link
Member Author

schuemie commented Mar 6, 2023

@gowthamrao , as always, you can combine cohort builders by putting their settings in a list. You just have to be careful they don't have overlapping analysis IDs, or else you could get colliding covariate IDs. So in your example, you'd do:

# this is from CohortDiagnostics - its a temporalCovariateSetting
covariateSettings1 <- CohortDiagnostics::getDefaultCovariateSettings()

covariateSettings2 <- FeatureExtraction::createCohortBasedTemporalCovariateSettings(
    analysisId = 150,
    covariateCohortDatabaseSchema = covariateCohortDatabaseSchema,
    covariateCohortTable = covariateCohortTable,
    covariateCohorts = PhenotypeLibrary::getPhenotypeLog(cohortIds = c(1:10)) |>
      dplyr::select(cohortId, cohortName),
    valueType = "binary",
    temporalStartDays = covariateSettings1$temporalStartDays,
    temporalEndDays = covariateSettings1$temporalEndDays
  )
covariateSettings <- list(covariateSettings1, covariateSettings2)

@gowthamrao
Copy link
Member

gowthamrao commented Mar 21, 2023

Documentation of function FeatureExtraction::getDbCohortBasedCovariatesData and assertCheck here

checkmate::assertInt(cohortId, add = errorMessages)
seems to imply that cohortId should only support 1 cohort per execution. However, the -1 suggests the function is able to support more than one cohortId at a time

cohortId For which cohort ID should covariates be constructed? If set to -1, covariates will be constructed for all cohorts in the specified cohort table.

This behavior is different from FeatureExtraction::getDbCovariateData where there is no assertCheck and is able to accept more than one cohortId

cohortId = -1,

Maybe the checkmate::assertInt(cohortId, add = errorMessages) should be changed to support an array of cohortIds?

See reproducible code

This seems to work fine


remotes::install_github("OHDSI/FeatureExtraction", ref = "cohortCovariates")
remotes::install_github("OHDSI/SkeletonCohortDiagnostics")

cohortDefinitionSet <-
  CohortGenerator::getCohortDefinitionSet(
    settingsFileName = "settings/CohortsToCreate.csv",
    jsonFolder = "cohorts",
    sqlFolder = "sql/sql_server",
    packageName = "SkeletonCohortDiagnosticsStudy",
    cohortFileNameValue = "cohortId"
  ) %>%  dplyr::tibble()

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
connection <-
  DatabaseConnector::connect(connectionDetails = connectionDetails)


cohortTableNames = CohortGenerator::getCohortTableNames(cohortTable = 'cohort')

# Execution ----
## Create cohort tables on remote ----
CohortGenerator::createCohortTables(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = 'main',
  cohortTableNames = cohortTableNames,
  incremental = FALSE
)
## Generate cohort on remote ----
cohortGenerated <- CohortGenerator::generateCohortSet(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = 'main',
  cohortTableNames = cohortTableNames,
  cohortDefinitionSet = cohortDefinitionSet,
  cohortDatabaseSchema = 'main',
  incremental = FALSE
)

FeatureExtractionSettings <- FeatureExtraction::createDefaultTemporalCovariateSettings()
FeatureExtractionSettingsCohortBasedCovariateSettings <-
  FeatureExtraction::createCohortBasedTemporalCovariateSettings(
    analysisId = 150,
    covariateCohortDatabaseSchema = 'main',
    covariateCohortTable = cohortTableNames$cohortTable,
    covariateCohorts = cohortDefinitionSet |>
      dplyr::filter(cohortId %in% c(14907)) |>
      dplyr::select(cohortId,
                    cohortName),
    valueType = "binary",
    temporalStartDays = FeatureExtractionSettings$temporalStartDays,
    temporalEndDays = FeatureExtractionSettings$temporalEndDays
  )
FeatureExtractionCovariateSettings <-
  list(
    FeatureExtractionSettingsCohortBasedCovariateSettings,
    FeatureExtraction::createDefaultTemporalCovariateSettings()
  )


featureExtractionOutput <-
  FeatureExtraction::getDbCovariateData(
    connection = connection,
    cdmDatabaseSchema = 'main',
    cohortDatabaseSchema = 'main',
    cohortTable = cohortTableNames$cohortTable,
    covariateSettings = FeatureExtractionCovariateSettings,
    aggregated = TRUE
  )

however, if in the last step, i subset to only selected cohortIds as

featureExtractionOutput <-
  FeatureExtraction::getDbCovariateData(
    connection = connection,
    cdmDatabaseSchema = 'main',
    cohortDatabaseSchema = 'main',
    cohortTable = cohortTableNames$cohortTable,
    cohortId = cohortDefinitionSet[1:2, ]$cohortId,
    covariateSettings = FeatureExtractionCovariateSettings,
    aggregated = TRUE
  )

then i get this error

Error in FeatureExtraction::getDbCovariateData(connection = connection, :
1 assertions failed:

  • Variable 'cohortId': Must have length 1.
    2: checkmate::assertIntegerish(temporalStartDays, add = errorMessages)
    1: FeatureExtraction::createCohortBasedTemporalCovariateSettings(analysisId = 150,
    covariateCohortDatabaseSchema = "main", covariateCohortTable = cohortTableNames$cohortTable,
    covariateCohorts = dplyr::select(dplyr::filter(cohortDefinitionSet,
    cohortId %in% c(14907)), cohortId, cohortName), valueType = "binary",
    temporalStartDays = FeatureExtractionSettingsCohortDiagnostics$temporalStartDays,
    temporalEndDays = FeatureExtractionSettingsCohortDiagnostics$temporalEndDays)

The difference being

cohortId = cohortDefinitionSet[1:2, ]$cohortId,

@schuemie
Copy link
Member Author

Yes @gowthamrao , I think you're right. The cohortId field in any covariate builder (including getDbCovariateData()) allows a vector of cohort IDs, but is still called cohortId (singular) for historical reasons. I just forgot when creating the input checks for getDbCohortBasedCovariatesData(). I'll modify the check, and add a unit test for multiple cohort IDs.

@anthonysena : perhaps it is time to introduce a cohortIds argument, and deprecate cohortId?

@ginberg
Copy link
Collaborator

ginberg commented May 11, 2023

@schuemie @anthonysena this is looking good to me. I am just wondering 2 things:

  • can we make the CohortBasedCovariatesVignetteDataFetch such that it uses a public database? We recently updated vignetteDataFetch so it uses Eunomia and the data can be generated easily by anyone
  • should we update the feature extraction jar version to 3.3.0? Since that is the package version in the develop branch

@anthonysena anthonysena merged commit cf4a169 into develop Jun 28, 2023
@anthonysena anthonysena deleted the cohortCovariates branch June 28, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add covariate builder that uses cohorts to build (binary) features
4 participants