-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cohort covariates #167
Cohort covariates #167
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #167 +/- ##
===========================================
+ Coverage 93.30% 93.40% +0.10%
===========================================
Files 15 16 +1
Lines 1165 1304 +139
===========================================
+ Hits 1087 1218 +131
- Misses 78 86 +8
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing these changes, I just wanted to confirm that the desired behavior here. From my review, my understanding is that we aim to detect if a target cohort
table entry has a matching cohort_covariate
for the subject_id
and for the #covariate_cohort_ref.cohort_definition_id
specified. To do this, the cohort.cohort_start_date
is adjusted to see if covariate_cohort.cohort_start_date
the that falls in a specified time window. For temporal, this time window will use the #time_ref
to adjust the cohort_start_date
of the target cohort to see if the cohort_covariate.cohort_start_date
falls in that range and if it is "non-temporal" it will use the @start_day
and @end_day
respectively.
Just wanted to confirm this behavior and add a few more questions/comments before approving.
temporalEndDays = -365:-1, | ||
includedCovariateIds = c()) { | ||
errorMessages <- checkmate::makeAssertCollection() | ||
checkmate::assertInt(analysisId, lower = 1, upper = 999, add = errorMessages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also consider ensuring that the analysisId
specified does not collide with pre-specified analysis IDs? This would apply for creating the temporal & non-temporal settings objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, but I'm not sure at what point we can do that. Right now, multiple covariate settings are simply combined in a generic list, like here. That means the first place we can look for collisions is in the getDbCovariateData()
function (for example here. It would definitely be a good idea to add a check there, because throwing an error is better than generating different covariates with the same covariate ID, which would lead to strange but perhaps undetectable behavior. Ideally we'd be able to warn the user earlier, but I don't see when.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR so the createSettings functions throw a warning when the analysis ID is also used for one of the pre-specified analyses.
In addition, I recommend somewhere in the future we make it so getDbCovariateData()
throws an error if there are overlapping covariate IDs between covariate builders. I've created an issue for this.
Yes, I believe your summary is correct. |
@@ -76,20 +75,20 @@ public class FeatureExtraction { | |||
private static String ADD_DESCENDANTS_SQL = "SELECT descendant_concept_id AS id\nINTO @target_temp\nFROM @cdm_database_schema.concept_ancestor\nINNER JOIN @source_temp\n\tON ancestor_concept_id = id;\n\n"; | |||
|
|||
public static void main(String[] args) { | |||
//init("C:/Users/mschuemi/git/FeatureExtraction/inst"); | |||
init("C:/Users/mschuemi/git/FeatureExtraction/inst"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that look right @schuemie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a function that is used by the Java developer for testing. It is not used by the package. So yes, seems right.
@anthonysena is this mature for use? |
You can use it at your own risk - I reviewed this some time ago but I need to find some time to do a deeper dive and merge it into develop. |
OK - i am going to give it a try to generate some content for these two forum posts https://forums.ohdsi.org/t/phenotype-submission-transverse-myelitis/17769/6 |
I am curious to know, why
|
No strong reason. I think at the time I imagined most people would use 1-day windows for temporal covariates, and you can only be in a cohort once per day. But of course you could have bigger windows. Allowing count type temporal covariates would require some additions to the SQL. |
Now added: 4aa04e4 |
@schuemie i ran the script and this Pull Request I think is going to be very valuable. Example: I ran the branch on the cohorts in OHDSI Phenotype library. I selected cohortId = 19 which is 'Typical Pneumonia' and a set of Cohorts that maybe symptoms
I then created a composite cohort using CohortAlgebra
and then used all the symptom cohorts and the composite cohort as a covariate
I executed this on a US claims based data source and got these results This passes my "sniff test". i.e. the cohorts of dyspnea, cough, bronchitis are ranked higher in the typical pneumonia cohort. The composite of all these relevant symptoms and signs was found in 58% of the cohort. |
Now, i want to understand how we can represent the relationship between the two cohorts (target cohort and covariate cohort) in terms of Allen's interval algebra. https://en.wikipedia.org/wiki/Allen%27s_interval_algebra Based on my review of this SQL FeatureExtraction/inst/sql/sql_server/CohortBasedBinaryCovariates.sql Lines 35 to 43 in 4aa04e4
is logically equivalent to (i got this using SqlRender)
which becomes |
@gowthamrao: you're leaving out the window part (time_period.start_day and time_period.end_day). That SQL can be interpreted as: covariate_cohort_start <= window_end AND covariate_cohort_end >= window_start (remember that both the window start and end are defined relative to the target cohort start) This means that my current implementation sets the feature to value 1 for a window if there is at least 1 day in the window when someone was in the covariate cohort. |
So may be lets go thru scenarios: Lets start with time_period.start_day = 0 and time_period.end_day = 0, given this is this accuraate
|
I think that is correct, except 'x meets y' seems to imply there's no overlap? |
@schuemie in my testing, i wanted to compute features using both cohort covariates and feature covariates for the same set of cohorts. These are my temporal covariate settings. My question is - can i combine these temporalCovariateSettings into one - so that I get one result set with results from both featureCovariates and cohortCovariates, or should i execute them serially one after the other, and then try to combine the results outside FeatureExtraction? I think i would prefer to be able to combine them - so that i get one result set.
this is new, using
|
@gowthamrao , as always, you can combine cohort builders by putting their settings in a list. You just have to be careful they don't have overlapping analysis IDs, or else you could get colliding covariate IDs. So in your example, you'd do: # this is from CohortDiagnostics - its a temporalCovariateSetting
covariateSettings1 <- CohortDiagnostics::getDefaultCovariateSettings()
covariateSettings2 <- FeatureExtraction::createCohortBasedTemporalCovariateSettings(
analysisId = 150,
covariateCohortDatabaseSchema = covariateCohortDatabaseSchema,
covariateCohortTable = covariateCohortTable,
covariateCohorts = PhenotypeLibrary::getPhenotypeLog(cohortIds = c(1:10)) |>
dplyr::select(cohortId, cohortName),
valueType = "binary",
temporalStartDays = covariateSettings1$temporalStartDays,
temporalEndDays = covariateSettings1$temporalEndDays
)
covariateSettings <- list(covariateSettings1, covariateSettings2) |
Documentation of function
This behavior is different from FeatureExtraction/R/GetCovariates.R Line 72 in 4aa04e4
Maybe the See reproducible code This seems to work fine
however, if in the last step, i subset to only selected cohortIds as
then i get this error
The difference being
|
Yes @gowthamrao , I think you're right. The @anthonysena : perhaps it is time to introduce a |
@schuemie @anthonysena this is looking good to me. I am just wondering 2 things:
|
Addresses this issue.
Adds covariates based on other cohorts. These can be binary or 'count', meaning the count of cohort occurrences within the time window. Both types are available per-person and aggregated. It is also possible to construct temporal covariates, which are always binary. The temporal covariates can also be fetched as aggregated statistics.
Included are unit tests and a vignette.
I believe cohort-based covariates are an important step in moving HADES away from custom coding per study (specifically as part of Strategus).