This repository annotates GEO datasets using DrugBank and Cellosaurus labels. Then, we download and sex label all data.
The code directory is set up as follows:
- GEO metadata is downloaded using
00_get_list_all_gse.R
- output: (in
data/01_sample_lists/
) *<>_gse_gsm.csv
*gse_<>.csv
*gse_gsm_all_geo_dedup.csv
*gse_all_geo_info.csv
01_cell_labeling/
00_parse_cellosaurus.py
- input: cellosaurus XML, located in
data/00_db_data/cellosaurus.xml
- output:
data/00_db_data/cellosaurus.json
- input: cellosaurus XML, located in
01_process_cell_df.R
- output:
data/00_db_data/cellosaurus_df.txt
- output:
02_cell_line_labeling.R
- uses
data/01_sample_lists/gse_all_geo_info.csv
- output:
data/02_labeled_data/cell_line_mapped_gse.txt
- uses
02_drug_labeling/
00_drugbank_synonyms.py
- input:
drugbank.xml
- output:
data/00_db_data/drugbank_info.json
- input:
01_process_drugbank.R
- output:
data/00_db_data/drugbank_parsed.txt
- output:
02_drug_gse_labeling.R
- uses
data/01_sample_lists/gse_all_geo_info.csv
- output:
data/02_labeled_data/drugbank_mapped_gse.txt
- uses
This uses the NCBI aspera client and relies on wrenlab software packages and is slightly complicated to install.
sbatch 00_download_gse_wrapper.sh ${ID} ${GSE_LIST}
This runs download_geo_chunk.sh
which runs downloadGEO.py
for each individual GSE.
We do this for ID=["mouse", "rat", and "human"], and use the files "data/sample_lists/gse_${ID}.csv".
// TODO: download GPL scripts
This directory processes downloaded GSEs using exprsex based on a reference list of files.
00_convert_to_mat.sh
01_label_mat.sh
02_combine_mat.sh
03_train_test_divide.R
04_run_meta.sh
05_train_test_mat.sh
Required files:
- GEOMetadb (update GEOMetadb in utils path to this)
- drugbank and cellosaurus XML in the
00_db_data
directory - jake_stopwords file