This is the official implementation of the paper JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling.
Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training.
Clone the repository with HTTPS:
git clone https://github.com/prescient-design/jamun.git
or SSH:
git clone git@github.com:prescient-design/jamun.git
Navigate to the cloned repository:
cd jamun
We recommend creating a mamba
or conda
environment.
This is because certain dependencies are tricky to install directly.
conda create --name jamun python=3.11 -y
conda activate jamun
conda install -c conda-forge ambertools=23 openmm pdbfixer pyemma -y
conda install pulchra -c bioconda -y
The remaining dependencies can be installed via pip
or uv
(recommended).
uv pip install -e .[dev]
The uncapped 2AA data from Timewarp can be obtained from Hugging Face.
cd /path/to/data/root/
git lfs install
git clone https://huggingface.co/datasets/microsoft/timewarp
where /path/to/data/root/
is the path where you want to store the datasets.
This should be your directory structure:
/path/to/data/root/
└── timewarp/
├── 2AA-1-big/
│ └── ...
├── 2AA-1-large/
│ └── ...
Now, set the environment variable JAMUN_DATA_PATH
:
export JAMUN_DATA_PATH=/path/to/data/root/
or, create a .env
file in the root of the repository and set JAMUN_DATA_PATH
:
JAMUN_DATA_PATH=/path/to/data/root/
Set the environment variable JAMUN_ROOT_PATH
(default: current directory) to specify where outputs from training and sampling are saved:
export JAMUN_ROOT_PATH=...
or in the .env file in the root of the repository:
JAMUN_ROOT_PATH=...
Once you have downloaded the data and set the appropriate variables correctly, you can start training on Timewarp.
We recommend first running our test config (on one GPU) to check that installation was successful:
CUDA_VISIBLE_DEVICES=0 jamun_train --config-dir=configs experiment=train_test.yaml
Then, you can train on the uncapped 2AA peptides dataset:
jamun_train --config-dir=configs experiment=train_uncapped_2AA.yaml
or the uncapped 4AA peptides dataset:
jamun_train --config-dir=configs experiment=train_uncapped_4AA.yaml
We also provide example SLURM launcher scripts for training and sampling on SLURM clusters:
sbatch scripts/slurm/train.sh
sbatch scripts/slurm/sample.sh
We provide trained models (for both sampling, and restarting training) for Timewarp 2AA, Timewarp 4AA, MDGen 4AA and other datasets at https://huggingface.co/ameya98/JAMUN:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ameya98/JAMUN
If you want to test out your own trained model,
either specify the wandb_train_run_path
(in the form entity/project/run_id
, which can be obtained from the Overview tab in the Weights and Biases UI for your training run), or the checkpoint_dir
of the trained model.
jamun_sample ... ++wandb_train_run_path=[WANDB_TRAIN_RUN_PATH]
jamun_sample ... ++checkpoint_dir=[CHECKPOINT_DIR]
If you want to sample conformations for a particular peptide sequence, you need to first generate a .pdb
file.
We provide a script that uses AmberTools, specifically tleap
. If you have a .pdb
file already, then you can skip this step.
Run:
python scripts/prepare_pdb.py [SEQUENCE] --mode [MODE] --outputdir [OUTPUTDIR]
where SEQUENCE
is your peptide sequence entered as a string of one-letter codes (eg. AGPF) or a string of hyphenated three letter codes (eg. ALA-GLY-PRO-PHE), MODE
is either capped
or uncapped
to add capping ACE and NME residues, and OUTPUTDIR
is where your generated .pdb
file will be saved (default is current directory).
The script will print out the path to the generated .pdb
file, INIT_PDB
.
Run the sampling script, starting from the provided .pdb
structure:
jamun_sample --config-dir=configs experiment=sample_custom ++init_pdb=[INIT_PDB]
We also provide some configs to sample from the uncapped 2AA and 4AA peptides from the test set in Timewarp.
jamun_sample --config-dir=configs experiment=sample_uncapped_2AA.yaml checkpoint_dir=...
jamun_sample --config-dir=configs experiment=sample_uncapped_4AA.yaml checkpoint_dir=...
We provide scripts for analysing JAMUN and original MD trajectories in [https://github.com/prescient-design/jamun/tree/main/analysis].
We provide scripts for generating MD simulation data with OpenMM, including energy minimization and calibration steps with NVT and NPT ensembles.
python scripts/generate_data/run_simulation.py [INIT_PDB]
The defaults correspond to our setup for the capped diamines.
Please run this script with the -h
flag to see all simulation parameters.
Some of the datasets require some preprocessing for easier consumption, for eg. the MDGen data:
source .env
python scripts/process_mdgen.py \
--input-dir ${JAMUN_DATA_PATH}/mdgen \
--output-dir ${JAMUN_DATA_PATH}/mdgen/data/4AA_sims_partitioned_chunked
If you found this repository useful, please cite our preprint!
@misc{daigavane2024jamuntransferablemolecularconformational,
title={JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling},
author={Ameya Daigavane and Bodhi P. Vani and Saeed Saremi and Joseph Kleinhenz and Joshua Rackers},
year={2024},
eprint={2410.14621},
archivePrefix={arXiv},
primaryClass={physics.bio-ph},
url={https://arxiv.org/abs/2410.14621},
}