The scripts are contained in the two directories:
publication
: Experiments (mainly model training) with the LC-MS²Struct SSVM modelcomparison
: Experiments with the comparison methods (e.g. RT filtering, XLogP3, ...)
There is a subdirectory massbank
for each setting which contains the run-scripts.
All experiments with the LC-MS²Struct implementation can be performed with the script run_with_gridsearch.py
.
The input parameters are specified in load_experiment_parameter.sh
(for the manuscript experiment version 4 was used). For each conducted experiment the complete list of command line interface parameters can be found, for example,
here (ALLDATA setting) or here (ONLYSTEREO).
Such parameter files are outputted by each experiment can be found alongside the results in results_raw
folder
at the root of this repository (please note that for that you need to download the result files from the Zenodo:
ALLDATA and ONLYSTEREO).
For our manuscript, we ran the experiments on a cluster. If you want to reproduce the full setting, then this can be
done by adapting the run_massbank__with_deps.sh
to your cluster. This script trains eight (8) SSVM models for each
evaluation set (eval_set_id
). In total this are about 350 jobs for ALLDATA and about 100 jobs for ONLYSTEREO.
The script furthermore produces the files in results_raw
, which we uploaded to Zenodo:
- Predicted max-marginal scores for each MassBank-subset and sampled LC-MS² dataset
- Experimental parameters
- List of spectra ids used for evaluation
- ...
You can compare your output to files on Zenodo.
Here we will give a minimal example, which can serve as a starting point to re-produce our experiments. If you want to
apply a trained LC-MS²Struct model to new data, you can train a model as done in run_with_gridsearch.py
, but you need to implement your own candidate set wrapper.
To run our minimal example, please follow the following instructions. We assume that you are at the root of this repository. Please note that all code was developed and tested in a Linux environment. Other operating systems are not supported.
- Install the LC-MS²Struct described in the repository
- Make sure that the conda environment you installed LC-MS²Struct into is activated
- Download our massbank.sqlite.gz database file and unpack it:
gunzip /path/to/your/massbank.sqlite.gz
## Will produce the file: /path/to/your/massbank__with_pubchemlite.sqlite
- Change to directory containing the run-script:
cd run_scripts/publication/massbank
- Run the training and max-margin prediction for two SSVM model:
NUMBA_NUM_THREADS=2;OMP_NUM_THREADS=2;OPENBLAS_NUM_THREADS=2;
python run_with_gridsearch.py 0 0 \
--n_jobs=2 \
--n_threads_test_prediction=4 \
--debug=1 \
--n_samples_train=128 \
--max_n_train_candidates=25 \
--ms2scorer=cfmid4__norm \
--C_grid 64 \
--n_epochs=2 \
--db_fn=/path/to/your/massbank__with_pubchemlite.sqlite \
--training_dataset=massbank
and
NUMBA_NUM_THREADS=2;OMP_NUM_THREADS=2;OPENBLAS_NUM_THREADS=2;
python run_with_gridsearch.py 0 1 \
--n_jobs=2 \
--n_threads_test_prediction=4 \
--debug=1 \
--n_samples_train=128 \
--max_n_train_candidates=25 \
--ms2scorer=cfmid4__norm \
--C_grid 64 \
--n_epochs=2 \
--db_fn=/path/to/your/massbank__with_pubchemlite.sqlite \
--training_dataset=massbank
Running this scripts will take about 25 minutes on a modern 4 core machine. You might need to adapt the
*_NUM_THREADS
variables and n_jobs
setting with
respect to the actual number of your physical cores, i.e. *_NUM_THREADS
x n_jobs
= number of physical
cores available. Please also have a look on the cluster scripts (run_massbank__with_deps.sh
and _run_massbank.sh
)
to see we configured the thread settings given a specific amount of resources.
- The output of the above runs is stored in
debugging/massbank
ls debugging/massbank
## Expected output
# debug__ds=AC_003__lloss_mode=mol_feat_fps__mol_feat=FCFP__binary__all__3D__mol_id=cid__ms2scorer=cfmid4__norm__ssvm_flavor=default
- The max-marginals are stored in the respective SSVM models' subdirectories (
ssvm_model_idx
). The aggregated margins can be computed using:
python combine_margins_from_different_ssvm_models.py \
/path/to/this/repository/run_scripts/publication/massbank/debugging/massbank \
cfmid4__norm \
--candidate_aggregation_identifier=inchikey1 \
--write_out_averaged_margins \
--debug
Note that you need to give the absolute path to the location of the max-marginals' base-directory.
- After running the above command the directory structure will look like:
tree
## Expected output:
# debugging/
# └── massbank
# └── debug__ds=AC_003__lloss_mode=mol_feat_fps__mol_feat=FCFP__binary__all__3D__mol_id=cid__ms2scorer=cfmid4__norm__ssvm_flavor=default
# ├── combined__cand_agg_id=inchikey1__marg_agg_fun=average
# │ ├── marginals__spl=0.pkl.gz
# │ ├── top_k__max_models.tsv
# │ └── top_k.tsv
# ├── ssvm_model_idx=0
# │ ├── eval_spec_ids__spl=0.list
# │ ├── grid_search_results__spl=0.tsv
# │ ├── marginals__spl=0.pkl.gz
# │ ├── parameters__spl=0.list
# │ ├── top_k__spl=0.tsv
# │ └── train_stats__spl=0.tsv
# └── ssvm_model_idx=1
# ├── eval_spec_ids__spl=0.list
# ├── grid_search_results__spl=0.tsv
# ├── marginals__spl=0.pkl.gz
# ├── parameters__spl=0.list
# ├── top_k__spl=0.tsv
# └── train_stats__spl=0.tsv
Here, combined
refers to the aggregated max-margins (averaged across the SSVM models) and the candidate scores
being collapsed based on cand_agg_id
. The top-k accuracies (comparing Only MS² and LC-MS²Struct (MS + RT))
performance are stored as well (top_k.tsv
), whereby all evaluation subsets for the particular MassBank
subset are stored in a single file.
Scripts containing the implementations of the methods we compare with in our manuscript.