The Antibody Domainbed dataset has been designed to
- incorporate state-of-the-art (SOTA) approaches in domain generalization to build robust predictors for the binding properties of biological sequences
- assess domain generalization algorithms in a real-world problem inspired by therapeutic antibody discovery.
The dataset (sequence + structure) and corresponding backbones have been seamlessly integrated within the largest out-of-distribution (OOD) benchmark - Domainbed.
- Natasa Tagasovska, Prescient/MLDD, Genentech (Owner / Contributor / Manager)
- Ji Won Park, Prescient/MLDD, Genentech (Owner / Contributor / Manager)
- Matthieu Kirchmayer, Prescient/MLDD, Genentech (Owner / Contributor / Manager)
Genentech
Corporate (Biotech, Pharma)
- Publishing POC: Natasa Tagasovska, Ji Won Park
- Affiliation: Prescient/MLDD, Genentech
- Contact: natasa.tagasovska@roche.com, jiwon.park@gene.com
Genentech, Roche
- Natasa Tagasovska, Principal ML Scientist, Prescient Design, Genentech, 2024
- Ji Won Park, Principal ML Scientist, Prescient Design, Genentech, 2024
- Matthieu Kirchmayer, Senior ML Scientist, Prescient Design, Genentech, 2024
- Nathan C. Frey, Principal ML Scientist, Prescient Design, Genentech, 2024
- Andrew Martin Watkins, Director of computational biology, Prescient Design, Genentech, 2024
- Aya Abdelsalam Ismail, Senior ML Scientist, Prescient Design, Genentech, 2024
- Arian Rokkum Jamasb, Senior ML Scientist, Prescient Design, Genentech, 2024
- Edith Lee, Data Engineer, Prescient Design, Genentech, 2024
- Tyler Bryson, Data Engineer, Prescient Design, Genentech, 2024
- Stephen Ra, Director of Frontier Research, Prescient Design, Genentech, 2024
- Kyunghyun Cho, Senior Director of Frontier Research, Prescient/MLDD, Genentech and NYU Center for Data Science, 2024
- Genentech
- Synthetically generated antibodies are represented in the corresponding sequence and structure format.
Category | Data |
---|---|
Size of Dataset | 12.9 MB |
Number of Instances | 10 669 |
Number of Fields | 14 |
Labeled Classes | 2 |
Number of Labels | 10 669 |
Number of environments | 5 |
Number of genrative models | 5 |
Number of wild types (seeds) | 179 |
Number of antigens | 3 |
A given molecule has good therapeutic properties if it can tightly bind to an Antigen (target) of interest. Since wet-lab binding measurements are expensive (both in terms of money and time), domain experts rely on computational properties such as Gibbs energy which has been shown to be indicative of binding. We use published, ML sota models for generating a library of in-vitro functional synthetic antibodies. We condition the generation with an initial wild type, i.e. seeds, which were selected from an existing database of experimentally obtained antibody-antigen complexes. Then, each record in our dataset corresponds to the difference in energy as computed by Rosetta scoring function ref2015, between the original Ab-Ag complex and the AbMutated-Ag complex. The difference in energy can be interpreted as 'decrease in energy' signaling better/stronger binding of the two molecules. For each new design in our database we record the source (generative model), antigen, heavy and light chain with both sequence and structure representation. We split the datapoints into environments emulating active drug design cycles. The environment labels are intended to help DG algorithms and they can also be modified by users of the datasets to mimic a setup corresponding to their needs.
- this is a synthetic sequence library, there is no sensitive data.
- No Known Risks
Actively Maintained - No new versions will be made available, however, this dataset will be actively maintained, including but not limited to updates to the data and extending with new data representations.
Current Version: 2.0
Last Updated: 05/2024
Release Date: 05/2024
Versioning: New versions will include adding new instances, new data transformations or new data representations.
Updates: Updates may include addiitonal meta data or documentation.
Errors: Any reported errors, if affecing the dataset, will result with and update and new version.
- Sequence
- pdb structures (adjusting the original SabDab complexes by applying InterfaceAnalyzerMover from Rosetta).
Field Name | Field Value | Description |
---|---|---|
seq id | a07434d97d68b31f55da2efa851e77c6 | a unique design identifier |
FV Heavy | QVRLAQYGGGVKRLGASMTLSCVASGYTFNDYYIHWVRQAPGQGFELLGYIDPANGRPDYAGALRERLSFYRDKSMETLYMDLRSLRYDDTAMYYCVRNVGTAGSLLHYDHWGSGSPVIVSS | variable framework region, heavy chain sequence |
FV Light | EIVLTQSPATLSASPGERVTLTCKASRSVGNNVAWYQHKGGQSPRLLIYDASTRAAGVPARFSGSASGTEFTLAISNLESEDFTVYFCLQYNNWWTFGQGTRVDIK | variable framework region light chain sequence |
FV Heavy aHo | QVRLAQY-GGGVKRLGASMTLSCVASG-YTFND-----YYIHWVRQAPGQGFELLGYIDPA---NGRPDYAGALRERLSFYRDKSMETLYMDLRSLRYDDTAMYYCVRNVGTAG-----------------SLLHYDHWGSGSPVIVSS | heavy chain sequence after aHo alignment |
FV Light aHo | EIVLTQSPATLSASPGERVTLTCKAS--RSVG------NNVAWYQHKGGQSPRLLIYD--------ASTRAAGVPARFSGSASG--TEFTLAISNLESEDFTVYFCLQYNN------------------------ | light chain sequence after aHo alignment |
pdb | 2adf | pdf identifier of initial Ab-Ag complex |
ddG | 2.49 | difference in energies of the initial complex and the complex after mutations have been applied |
is_more_stable | 0 | binary label stating if the designed complex is more stable (indicating better binding) than the initial complex or not |
source | WJS 0.5 | generative model for the design - walk jump sampler with noise level 0.5 |
antigen | HIV1 | antigen target of interest |
environment | env0 | categorical variable representing the environemnet (design cycle) |
edist seed | 4 | edit distance between initial antibody (seed) and proposed design |
Above: Example of an actual data point with descriptors
- Research
- ML for drug design
- DG algorithms in science applications
Out-of-distribution Generalization
, Distribution shift
,
Drug design
,proteins
, antibodies
.
- ML models are prone to leach on spurious correlations to maxmize performance, which leads to poor results when the same models are faced with OOD data. DG could help aleviate this issue.
- DG algorithms have been tested mostly on image datasets and in part, this might be due to the gap between DG algorithms and science datasets. This is the gap we aim to fill with Antibody Domainbed.
-
Evaluating existing and new DG algorithms.
-
Evaluating existing and new protein property predictors.
-Pretraining/Fine-tuning property predictors on Antibody Domainbed.
-
Custom split into new set of environments to evaluate environment discovery algorithms.
-
Extending the dataset by computing additional properties for the antibodies (hydrophobicity, charge, humanness etc).
-
None of the included molecules have been tested in wet-lab experiments, hence they are not advised to be used as therapeutic proteins without consulting domain experts.
-
It is not advised to use the included pdbs of the synthetic designs for computing the impact on energy by further mutations, due to error propagation. Please use the initial/seed pdb files if applying new mutations to the datapoints in Antibody domainbed.
- Can we learn invariant/causal features that drive the binding of molecular complexes?
- Do DG algorithms pick up physically meaningful features?
- Does heterogenous data help in developing more roust predictors?
- Can we trust ML models for predicting properties for new/unseen targtes?
Citation: Tagasovska, N., Park, J. W., Kirchmeyer, M., Frey, N., Watkins, A., Abdelsalam Ismail, A., Jamasb, A. R., Lee, E., Bryson, T., Ra, S., & Cho, K. (2024). Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design. https://doi.org/10.5281/zenodo.11446107
BiBTeX:
@article{abdb2024, title={Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design}, author={Tagasovska, N., Park, J. W., Kirchmeyer, M., Frey, N., Watkins, A., Abdelsalam Ismail, A., Jamasb, A. R., Lee, E., Bryson, T., Ra, S., Cho, K}, journal={in submission}, year={2024}, }
- External - Open Access Antibody Domainbed is made available under the CC BY 4.0 license. A copy of the license is provided with the dataset. The authors bear all responsibility in case of violation of rights.
- No prerequisits for using the dataset.
- Artificially Generated Dataset
The primary goal of this benchmark is to emulate a real-world active drug design setup. To create a realistic and publicly accessible dataset, we followed these steps: (1) Collect open data on antibody-antigen complex structures; (2) Train generative models to sample antibodies with varying statistics (e.g., edit distances from training data, targets of interest, different seed complexes); (3) Compute binding proxies using physics-based models for all designs from step (2); and (4) Split the labeled dataset into meaningful environments for the application of domain generalization (DG) algorithms. In the full manuscript, we show that our antibodies span reasonable ranges in various metrics, including biophysical properties (e.g., hydrophobicity, aromaticity) and distributional scores evaluating proximity to a database of natural antibodies (e.g., naturalness).
We select antibodies associated with the antigens HIV1, SARS‑CoV‑2, and HER2. For Environments (Env) 0-3, we post-process paired antibody-antigen structures from the latest version (released in 2022) of the Structural Antibody Database (SAbDab). Antibody sequences corresponding to 178 SAbDab structures for the three antigens serve as the starting seeds for our sequence-based generative models (see Step 2), and we modify the structures based on mutations made by the model to compute the labels (see Step 3). For Env 4, we use a recent derivative, Graphinity, which extends SAbDab to a synthetic dataset of a much larger scale (~1M) by introducing systematic point mutations in the CDR3 loops of the antibodies in the SAbDab complexes.
To emulate the active drug discovery pipeline, we need a suite of generative models for sampling new candidate designs for therapeutic antibodies. We run the Walk Jump Sampler, a method building on the neural empirical Bayes framework.
WJS separately trains score- and energy-based models to learn noisy data distributions and sample discrete data. The energy-based model is trained on noisy samples, which means that by training with different noise levels
As lab assays to experimentally measure binding affinity are prohibitively expensive, we use computational frameworks which, by modeling changes in binding free energies upon mutation (interface
To emulate the active drug design setup where distribution shifts may appear in each design cycle (e.g., due to new generative models, antigen targets, or experimental assays), we split the overall dataset into five environments. In Table 1 in the manuscript we summarize our environment definitions as well as key summary statistics of each environment. For Env 0-3, the dataset split mimics sub-population shifts due to the generative model, which produces antibody sequences with different edit distances from the seed sequences. The WJS model with
Is this source considered sensitive or high-risk? [No]
Dates of Collection: [07 2023 - 12 2023]
Primary modality of collection data:
- Therapeutic protein sequences and structures.
- Source: Structural Antibody Database, SAbDab published in 2022.
- Source: Grafinity - Antibody-Antigen ∆∆G Prediction 2023.
We used these two sources to curate the set of antibodies for training our generative models and use some hold out datasets to represet initial points, i.e. seeds.
-
The structures used in the computation of changes in binding energy were derived using computational tools rather than experimentally measured structures, which are very expensive to obtain. Hence, our dataset does include some biases inherent to the computational methods.
-
We don't include the structures for env4, however, they can be found in the original repository of the Grafinity dataset.
- Release date: 05/2024
- Link to dataset: Zenodo version 1
- Status: [Actively Maintained]
- Size of Dataset: 12.9 MB (sequence);
- Number of Instances: 10 669
- Release date: 05/2024
- Link to dataset: Zenodo version 2
- Status: [Actively Maintained]
- Size of Dataset: 12.9 MB (sequence); 1.9 GB (structure)
- Number of Instances: 10 669