JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

This is the official implementation of the paper JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling.

Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training.

Setup

Clone the repository with HTTPS:

git clone https://github.com/prescient-design/jamun.git

or SSH:

git clone git@github.com:prescient-design/jamun.git

Navigate to the cloned repository:

cd jamun

We recommend creating a mamba or conda environment. This is because certain dependencies are tricky to install directly.

conda create --name jamun python=3.11 -y
conda activate jamun
conda install -c conda-forge ambertools=23 openmm pdbfixer pyemma -y
conda install pulchra -c bioconda -y

The remaining dependencies can be installed via pip or uv (recommended).

uv pip install -e .[dev]

Data

The uncapped 2AA data from Timewarp can be obtained from Hugging Face.

cd /path/to/data/root/
git lfs install
git clone https://huggingface.co/datasets/microsoft/timewarp

where /path/to/data/root/ is the path where you want to store the datasets.

This should be your directory structure:

/path/to/data/root/
└── timewarp/
    ├── 2AA-1-big/
    │   └── ...
    ├── 2AA-1-large/
    │   └── ...

Now, set the environment variable JAMUN_DATA_PATH:

export JAMUN_DATA_PATH=/path/to/data/root/

or, create a .env file in the root of the repository and set JAMUN_DATA_PATH:

JAMUN_DATA_PATH=/path/to/data/root/

Set the environment variable JAMUN_ROOT_PATH (default: current directory) to specify where outputs from training and sampling are saved:

export JAMUN_ROOT_PATH=...

or in the .env file in the root of the repository:

JAMUN_ROOT_PATH=...

Training

Once you have downloaded the data and set the appropriate variables correctly, you can start training on Timewarp.

We recommend first running our test config (on one GPU) to check that installation was successful:

CUDA_VISIBLE_DEVICES=0 jamun_train --config-dir=configs experiment=train_test.yaml

Then, you can train on the uncapped 2AA peptides dataset:

jamun_train --config-dir=configs experiment=train_uncapped_2AA.yaml

or the uncapped 4AA peptides dataset:

jamun_train --config-dir=configs experiment=train_uncapped_4AA.yaml

We also provide example SLURM launcher scripts for training and sampling on SLURM clusters:

sbatch scripts/slurm/train.sh
sbatch scripts/slurm/sample.sh

Inference

Loading Trained Models

We provide trained models (for both sampling, and restarting training) for Timewarp 2AA, Timewarp 4AA, MDGen 4AA and other datasets at https://huggingface.co/ameya98/JAMUN:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ameya98/JAMUN

If you want to test out your own trained model, either specify the wandb_train_run_path (in the form entity/project/run_id, which can be obtained from the Overview tab in the Weights and Biases UI for your training run), or the checkpoint_dir of the trained model.

jamun_sample ... ++wandb_train_run_path=[WANDB_TRAIN_RUN_PATH]
jamun_sample ... ++checkpoint_dir=[CHECKPOINT_DIR]

Sampling Conformations for a Peptide Sequence

If you want to sample conformations for a particular peptide sequence, you need to first generate a .pdb file.

We provide a script that uses AmberTools, specifically tleap. If you have a .pdb file already, then you can skip this step.

Generate `.pdb` file

Run:

python scripts/prepare_pdb.py [SEQUENCE] --mode [MODE] --outputdir [OUTPUTDIR]

where SEQUENCE is your peptide sequence entered as a string of one-letter codes (eg. AGPF) or a string of hyphenated three letter codes (eg. ALA-GLY-PRO-PHE), MODE is either capped or uncapped to add capping ACE and NME residues, and OUTPUTDIR is where your generated .pdb file will be saved (default is current directory). The script will print out the path to the generated .pdb file, INIT_PDB.

Run sampling on `.pdb`

Run the sampling script, starting from the provided .pdb structure:

jamun_sample --config-dir=configs experiment=sample_custom ++init_pdb=[INIT_PDB]

Sampling Test Peptides from Timewarp

We also provide some configs to sample from the uncapped 2AA and 4AA peptides from the test set in Timewarp.

jamun_sample --config-dir=configs experiment=sample_uncapped_2AA.yaml checkpoint_dir=...

jamun_sample --config-dir=configs experiment=sample_uncapped_4AA.yaml checkpoint_dir=...

Analysis

We provide scripts for analysing JAMUN and original MD trajectories in [https://github.com/prescient-design/jamun/tree/main/analysis].

Data Generation

Running Molecular Dynamics with OpenMM

We provide scripts for generating MD simulation data with OpenMM, including energy minimization and calibration steps with NVT and NPT ensembles.

python scripts/generate_data/run_simulation.py [INIT_PDB]

The defaults correspond to our setup for the capped diamines. Please run this script with the -h flag to see all simulation parameters.

Preprocessing

Some of the datasets require some preprocessing for easier consumption, for eg. the MDGen data:

source .env
python scripts/process_mdgen.py \
  --input-dir ${JAMUN_DATA_PATH}/mdgen \
  --output-dir ${JAMUN_DATA_PATH}/mdgen/data/4AA_sims_partitioned_chunked

Citation

If you found this repository useful, please cite our preprint!

@misc{daigavane2024jamuntransferablemolecularconformational,
      title={JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling},
      author={Ameya Daigavane and Bodhi P. Vani and Saeed Saremi and Joseph Kleinhenz and Joshua Rackers},
      year={2024},
      eprint={2410.14621},
      archivePrefix={arXiv},
      primaryClass={physics.bio-ph},
      url={https://arxiv.org/abs/2410.14621},
}

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
analysis		analysis
configs/experiment		configs/experiment
env		env
figures		figures
profiling		profiling
scripts		scripts
src/jamun		src/jamun
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.in		requirements-dev.in
requirements.in		requirements.in
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

Setup

Data

Training

Inference

Loading Trained Models

Sampling Conformations for a Peptide Sequence

Generate `.pdb` file

Run sampling on `.pdb`

Sampling Test Peptides from Timewarp

Analysis

Data Generation

Running Molecular Dynamics with OpenMM

Preprocessing

Citation

About

Releases

Packages

Contributors 3

Languages

License

prescient-design/jamun

Folders and files

Latest commit

History

Repository files navigation

JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

Setup

Data

Training

Inference

Loading Trained Models

Sampling Conformations for a Peptide Sequence

Generate .pdb file

Run sampling on .pdb

Sampling Test Peptides from Timewarp

Analysis

Data Generation

Running Molecular Dynamics with OpenMM

Preprocessing

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Generate `.pdb` file

Run sampling on `.pdb`

Packages