Here we describe the scripts used to generate / load the:
- Candidate sets
- MS² matching scores (CFM-ID, MetFrag and SIRIUS)
- Molecular fingerprints and descriptors
- ...
into our SQLite database (DB). We start with the initial SQLite DB: massbank__2020.11__v0.6.1.sqlite
. This DB
was generated as described in the Methods "Pre-processing pipeline for raw MassBank records" of the manuscript based on
the MassBank release 2020.11 using the
"massbank2db" package (version 0.6.1). Please note, the latest package
version is 0.9.0, but the code parsing and grouping the MassBank records remained unchanged.
For the re-generation of the database it is required to install the following Python packages (preferably into a conda environment):
- "massbank2db": Contains the routines to convert the MassBank spectra to the input format of the insilico tools.
- "rosvm": Provides functionality to compute the molecule fingerprint feature representations.
- "ssvm": Provides functionality to convert counting fingerprints into binary representations for the efficient computation of MinMax kernels on integer vectors.
- "matchms": Provides routines to compute the similarity between MS² spectra needed for the CFM-ID score computation.
- "rdkit": Provides routines to compute the molecular descriptor features.
An R installation is required to compute the ClassyFire molecule classes. Furthermore, the following packages need to be installed:
- "classyfireR": An interface to the ClassyFire RESTful API
- "RSQLite": SQLite Interface for R
- The scripts modifying the DB always create a copy of the DB and add the new information (e.g. scores or features) to the copy, while preserving the original DB. You can modify this behavior, which is entirely a precautious approach.
The candidate sets where generated using the SIRIUS software by Dr. Kai Dührkop (developer of SIRIUS). SIRIUS uses PubChem as molecular structure DB and returns the candidate sets limited to molecules with the ground truth molecular formula of the particular MassBank spectrum. It is important to note, that neither the GUI nor the CLI tool of SIRIUS was used for the candidate set and MS² score generation. Instead, the non-public internal SIRIUS library was used which allows the score prediction in a structure disjoint fashion. That means, for each MassBank spectrum a CSI:FingerID (prediction backend of SIRIUS) model was used, that was not trained using its ground truth structure. This setting was chosen to prevent overfitting.
The following script call was used to generate the "SIRIUS ready" ms-files:
python massbank__2020.11__v0.6.1.sqlite sirius
A directory tools/sirius
will be created with sub-directory for each MassBank group (see Methods
"Pre-processing pipeline for raw MassBank records") containing the ms-files (*.ms
) for each group of original
MassBank accessions (see Methods "Pre-computing the MS² matching scores"). For example "AU22543794" in "AU_001" relates
to the original MassBank accessions "AU300907", "AU300908",
"AU300909", "AU300910" and "AU300911". The file tools/sirius/AU_001/
can be directly loaded into
the SIRIUS software.
By calling:
python db/massbank__2020.11__v0.6.1.sqlite tool_output/sirius sirius_scores.tar.gz \
a copy of our initial SQLite DB is generated (massbank__with_sirius.sqlite
) and the following information is
added to the database:
- all (spectrum, candidate)-pairs generated by SIRIUS
- all (spectrum, candidate, MS² scores) for SIRIUS
- enriched candidate sets (see Methods "Generating the molecular candidate sets")
- (optional,
) the binary fingerprints for each candidate as used by SIRIUS
Note: This step requires a local PubChem SQLite DB.
As SIRIUS does not return scores for stereoisomers we need to them manually to the candidate sets. For that, we perform an inner merge on first InChIKey part (e.g. FMGSKLZLMKYGDP-UHFFFAOYSA-N) of each candidate of the candidate set between the candidates provided by SIRIUS and a local copy of PubChem:
ROW | InChIKey | SIRIUS MS² score |
IDX | InChIKey | SIRIUS MS² score |
after the merge.
Removal of records associated with the #152 pull-request in MassBank
As described in the Methods "Pre-processing pipeline for raw MassBank records" we remove a couple of MassBank records
related to the "LU" datasets which where reported to have issues. For that, we compare which original "LU*" accessions
where removed from MassBank between release 2020.11 (our baseline) and release 2021.3. We list our internal accession IDs
in the file grouped_accessions_to_be_removed.txt
. Entries in this list will not be imported to massbank__with_sirius.sqlite
database and hence are not part of our experiments.
Other than SIRIUS we also used MetFrag and CFM-ID for the candidate ranking using the MS² information.
- We created the required MetFrag input files, which will be written to
, using the following command:
python massbank__with_sirius.sqlite metfrag
- We generate the candidate set files as required by the MetFrag software:
python massbank__with_sirius.sqlite tools/metfrag --gzip
We use the MetFrag software (v 2.4.5) for the insilico MS² scoring of the molecular candidates. A makefile (
) can be used to run the MetFrag software parallely on multiple cores. -
We import the
of MetFrag as candidate scores into the database.
python massbank__with_sirius.sqlite tools/metfrag
- A new database file has been created:
, which is a copy ofmassbank__with_sirius.sqlite
plus the MetFrag scores.
- We created the required CFM-ID input files, which will be written to
, using the following command:
python massbank__with_metfrag.sqlite cfmid4
- We generate the candidate set files as required by the CFM-ID software:
python massbank__with_metfrag.sqlite tools/cfmid4 --gzip --store_candidates_separately
We use CFM-ID (v 4.0.7) for the insilico MS² spectra prediction. The CFM-ID developers provide pre-trained models cross-validation (CV) models. That means we can predict the insilico MS² spectra for the candidate sets in a structure disjoint fashion. We use the models "Metlin 2019 MSML". The
directories contain a list of the CFM-ID training molecules and their respective left-out CV id, which we use for the structure disjoint prediction. The spectra simulation is a computationally very (!) heavy process and performing it on a cluster is highly recommend. We provide a couple a script (tools/cfmid4/
) illustrating how this can be done on a cluster using the SLURM workload manager. -
We load the predicted spectra and compute the similarity score with the corresponding measured spectrum. This similarity is used as CFM-ID MS² candidate score:
python massbank__with_metfrag.sqlite tools/cfmid4
- A new database file has been created:
, which is a copy ofmassbank__with_metfrag.sqlite
plus the CFM-ID scores.
For our experiments we use three (3) different molecular feature representations. In the following we describe how those can be computed and added to the database.
Our Structure Support Vector Machine (SSVM) model uses the FCFP fingerprints, computed from isomeric SMILES, to represent the molecular candidates.
- Compute the counting FCFP (with and without chirality encoding, "2D" and "3D") fingerprints:
python massbank__with_cfmid.sqlite
The fingerprints are inserted into a copy of the DB:
Convert the counting fingerprints into binarized counting vectors. The main purpose of this step is to speed up the kernel computation required for the SSVM. The binary representation still encodes the counts, but allows to use the Tanimoto kernel instead of the MinMax kernel still resulting in the same similarity values. For a details on the implementation the reader is pointed to the "ssvm" library the publication by Ralaivola et al. (2015):
python massbank__with_fcfp.sqlite FCFP
- A new database file has been created:
, which is a copy ofmassbank__with_fcfp.sqlite
plus the binarized scores.
Bouwmeester et al. (2019) defined a set of molecular descriptors, which are computed using RDKit, found useful for retention time (RT) prediction. We use those features for one of our comparison methods and add the features to our DB.
- Compute the descriptors and add them to the DB
python massbank__with_binary_fcfp.sqlite
- A new database file has been created:
, which is a copy ofmassbank__with_binary_fcfp.sqlite
plus the molecular descriptors.
Bach et al. (2020) used substructure counting
fingerprints to represent molecules in their RankSVM model. As we compare with their approach for MS² and RT score
integration, we add their features to our DB. We pre-computed the fingerprints using the CDK package
and fingerprint vectors are stored in substructure_fingerprints/candidates___SMILES_ISO.tsv.gz
. Please note
that the pre-computed fingerprints are limited to the ones in our current candidate set.
- Import the pre-computed substructure fingerprints:
python massbank__with_descriptors.sqlite substructure_fingerprints/candidates___SMILES_ISO.tsv.gz
- A new database file has been created:
, which is a copy ofmassbank__with_descriptors.sqlite
plus the substructure fingerprints.
We evaluate all methods, including our SSVM model, using a multiple sets (or sequences) of (MS², RT)-tuples sampled for each dataset / MassBank group in our dataset. We refer the reader to the method part of our paper explaining in detail the sampling procedure. Each tuple set has about 50 MS features (i.e. (MS², RT)-tuples). Depending on the amount of available data in each dataset, the number of sampled sets differs. We distinguish two sampling scenarios: "default" (FULLDATA in the paper) and "with_stereo" (ONLYSTEREO in the paper). Again, the interested reader is encouraged to read the corresponding method descriptions in our manuscript.
- Generate the evaluation LC-MS² experiments for both scenarios:
python massbank__with_substructure_fps.sqlite
- A new database file has been created:
, which is a copy ofmassbank__with_substructure_fps.sqlite
plus the LC-MS² experiments (indices of evaluation records).
In our experiments we do a specific analysis of the performance impact of our SSVM model on specific molecular
classes, based on two (2) different classification systems. The first one is ClassyFire
which assigns classes to molecules based on their structure. The second classification system is taken from
PubChemLite, which based on literature
information on the usage of certain molecules in certain contexts.
- Import the ClassyFire classes to the DB (no backup created):
Rscript insert_classyfire_classes.R massbank__with_substructure_fps.sqlite
Download the PubChemLite DB (v0.3.0)
Insert PubChemLite classifications to the DB:
python massbank__with_substructure_fps.sqlite /path/to/pubchemlite.csv
3A new database file has been created: massbank__with_pubchemlite.sqlite
, which is a copy of
plus the PubChemLite classification.