Skip to content

prescient-design/jamun

Repository files navigation

JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

This is the official implementation of the paper JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling.

JAMUN results on capped 2AA peptides

Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training.

Overview of walk-jump sampling in JAMUN

Setup

Clone the repository with HTTPS:

git clone https://github.com/prescient-design/jamun.git

or SSH:

git clone git@github.com:prescient-design/jamun.git

Navigate to the cloned repository:

cd jamun

We recommend creating a mamba or conda environment. This is because certain dependencies are tricky to install directly.

conda create --name jamun python=3.11 -y
conda activate jamun
conda install -c conda-forge ambertools=23 openmm pdbfixer pyemma -y
conda install pulchra -c bioconda -y

The remaining dependencies can be installed via pip or uv (recommended).

uv pip install -e .[dev]

Data

The uncapped 2AA data from Timewarp can be obtained from Hugging Face.

cd /path/to/data/root/
git lfs install
git clone https://huggingface.co/datasets/microsoft/timewarp

where /path/to/data/root/ is the path where you want to store the datasets.

This should be your directory structure:

/path/to/data/root/
└── timewarp/
    ├── 2AA-1-big/
    │   └── ...
    ├── 2AA-1-large/
    │   └── ...

Now, set the environment variable JAMUN_DATA_PATH:

export JAMUN_DATA_PATH=/path/to/data/root/

or, create a .env file in the root of the repository and set JAMUN_DATA_PATH:

JAMUN_DATA_PATH=/path/to/data/root/

Set the environment variable JAMUN_ROOT_PATH (default: current directory) to specify where outputs from training and sampling are saved:

export JAMUN_ROOT_PATH=...

or in the .env file in the root of the repository:

JAMUN_ROOT_PATH=...

Training

Once you have downloaded the data and set the appropriate variables correctly, you can start training on Timewarp.

We recommend first running our test config (on one GPU) to check that installation was successful:

CUDA_VISIBLE_DEVICES=0 jamun_train --config-dir=configs experiment=train_test.yaml

Then, you can train on the uncapped 2AA peptides dataset:

jamun_train --config-dir=configs experiment=train_uncapped_2AA.yaml

or the uncapped 4AA peptides dataset:

jamun_train --config-dir=configs experiment=train_uncapped_4AA.yaml

We also provide example SLURM launcher scripts for training and sampling on SLURM clusters:

sbatch scripts/slurm/train.sh
sbatch scripts/slurm/sample.sh

Inference

Loading Trained Models

We provide trained models (for both sampling, and restarting training) for Timewarp 2AA, Timewarp 4AA, MDGen 4AA and other datasets at https://huggingface.co/ameya98/JAMUN:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ameya98/JAMUN

If you want to test out your own trained model, either specify the wandb_train_run_path (in the form entity/project/run_id, which can be obtained from the Overview tab in the Weights and Biases UI for your training run), or the checkpoint_dir of the trained model.

jamun_sample ... ++wandb_train_run_path=[WANDB_TRAIN_RUN_PATH]
jamun_sample ... ++checkpoint_dir=[CHECKPOINT_DIR]

Sampling Conformations for a Peptide Sequence

If you want to sample conformations for a particular peptide sequence, you need to first generate a .pdb file.

We provide a script that uses AmberTools, specifically tleap. If you have a .pdb file already, then you can skip this step.

Generate .pdb file

Run:

python scripts/prepare_pdb.py [SEQUENCE] --mode [MODE] --outputdir [OUTPUTDIR]

where SEQUENCE is your peptide sequence entered as a string of one-letter codes (eg. AGPF) or a string of hyphenated three letter codes (eg. ALA-GLY-PRO-PHE), MODE is either capped or uncapped to add capping ACE and NME residues, and OUTPUTDIR is where your generated .pdb file will be saved (default is current directory). The script will print out the path to the generated .pdb file, INIT_PDB.

Run sampling on .pdb

Run the sampling script, starting from the provided .pdb structure:

jamun_sample --config-dir=configs experiment=sample_custom ++init_pdb=[INIT_PDB]

Sampling Test Peptides from Timewarp

We also provide some configs to sample from the uncapped 2AA and 4AA peptides from the test set in Timewarp.

jamun_sample --config-dir=configs experiment=sample_uncapped_2AA.yaml checkpoint_dir=...

jamun_sample --config-dir=configs experiment=sample_uncapped_4AA.yaml checkpoint_dir=...

Analysis

We provide scripts for analysing JAMUN and original MD trajectories in [https://github.com/prescient-design/jamun/tree/main/analysis].

Data Generation

Running Molecular Dynamics with OpenMM

We provide scripts for generating MD simulation data with OpenMM, including energy minimization and calibration steps with NVT and NPT ensembles.

python scripts/generate_data/run_simulation.py [INIT_PDB]

The defaults correspond to our setup for the capped diamines. Please run this script with the -h flag to see all simulation parameters.

Preprocessing

Some of the datasets require some preprocessing for easier consumption, for eg. the MDGen data:

source .env
python scripts/process_mdgen.py \
  --input-dir ${JAMUN_DATA_PATH}/mdgen \
  --output-dir ${JAMUN_DATA_PATH}/mdgen/data/4AA_sims_partitioned_chunked

Citation

If you found this repository useful, please cite our preprint!

@misc{daigavane2024jamuntransferablemolecularconformational,
      title={JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling},
      author={Ameya Daigavane and Bodhi P. Vani and Saeed Saremi and Joseph Kleinhenz and Joshua Rackers},
      year={2024},
      eprint={2410.14621},
      archivePrefix={arXiv},
      primaryClass={physics.bio-ph},
      url={https://arxiv.org/abs/2410.14621},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •