Ontology-guided Augmentation and Synthesis for ICD-9 Coding.

Addressing concept sparsity -- either due to surface-form variety (synonymy) or concept absence in the training data -- in ICD-9 coding with the aid of medical ontologies. Designed to work with the discharge summaries in the MIMIC-III dataset.

Library Requirements

Essential

numpy pandas owlready2

Non-essential

tqdm

External Resources

ICD-9 graph as seen in the CoPHE repository MedCAT MIMIC-III or a similar corpus with ICD-9 labels SemEHR UMLS

Scripts

string_manipulation.py This script cotains the augment method for creating strings through replacing stated substrings on specified positions with alternatives.

synonym_setup.py Sets up the synonym conversion table for data augmentation given a dataset and its corresponding NER output.

adjacent_setup.py Sets up the conversion table for adjacent concepts for data synthesis (specifying the unspecified).

augmentation_and_sythesis.py Runs the actual augmentation and synthesis routines given a dataset and conversion tables.

Use

First prepare your data (e.g., MIMIC-III), your UMLS distirbution, and your NER engine (e.g., SemEHR, or MedCAT).

Run synonym_setup.py the using raw labelled data in CSV format (path goes to disch-csv_path) and the output of your NER engine. A sample list of entities in the accepted format is presented in sample_entities.csv (these were derived from a freely accessible discharge summary in MT Samples). This script produces a synonym dictionary (syns.csv) and a CSV that combines the NER output with the synonyms (ner_output_with_syns.csv). This should have you set for augmentation.

Run adjacent_setup.py with specitying the path to your dataset and the path to the ICD-9 graph. This will create a conversion table between the assign codes and their siblings/cousins.

Run augmentation_and_synthesis.py, this actually performs the augmentation and synthesis. Sections of code related to each of these is indicated via comments and printouts, if you wish to only do one of these, just remove/comment them out of the main function. Note that the augmentation and sythesis involve some randomness in picking replacements, hence to make your augmentation/synthesis reproducible please use a random seed (this is currently set to 50). Furthermore, there is some value in running augmentation/synthesis multiple times and then de-duplicating the result as multiple synonyms/related codes may be used for replacement in the same situation. This behaviour is already implemented in the case of run_synthesis_adj with the iters parameter.

Theoretical background

For a theoretical background behind this implementation, please refer to the Horses to Zebras paper. An informal summary can be found in Summary.md.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
augmentation_and_synthesis		augmentation_and_synthesis
setup		setup
README.md		README.md
Summary.md		Summary.md
demo_mentions.csv		demo_mentions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ontology-guided Augmentation and Synthesis for ICD-9 Coding.

Library Requirements

Essential

Non-essential

External Resources

Scripts

Use

Theoretical background

About

Releases

Packages

Languages

modr00cka/Ontology-Guided-Augmentation-and-Synthesis

Folders and files

Latest commit

History

Repository files navigation

Ontology-guided Augmentation and Synthesis for ICD-9 Coding.

Library Requirements

Essential

Non-essential

External Resources

Scripts

Use

Theoretical background

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages