Skip to content

Commit 1d2a836

Browse files
authored
Merge pull request #6 from prescient-design/n/reenable
enable tests in CICD
2 parents e0dad8f + c1231b1 commit 1d2a836

File tree

91 files changed

+2235
-1174
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

91 files changed

+2235
-1174
lines changed

.gitattributes

+8
Original file line numberDiff line numberDiff line change
@@ -1 +1,9 @@
11
*.ckpt filter=lfs diff=lfs merge=lfs -text
2+
3+
# Set default behavior to automatically normalize line endings.
4+
* text=auto
5+
6+
# Explicitly declare text files you want to always be normalized and converted
7+
# to native line endings on checkout.
8+
*.py text eol=lf
9+
*.toml text

.github/workflows/push.yml

+53-49
Original file line numberDiff line numberDiff line change
@@ -10,33 +10,41 @@ jobs:
1010
python-version: "3.x"
1111
- run: "python -m pip install --upgrade build"
1212
- run: "python -m build ."
13-
- uses: "actions/upload-artifact@v3"
13+
- uses: "actions/upload-artifact@v4"
1414
with:
1515
name: "python-package-distributions"
1616
path: "dist/"
17-
# pytest:
18-
# strategy:
19-
# matrix:
20-
# platform:
21-
# - "macos-latest"
22-
# - "ubuntu-latest"
23-
# # - "windows-latest"
24-
# python:
25-
# - "3.10"
26-
# - "3.11"
27-
# runs-on: ${{ matrix.platform }}
28-
# steps:
29-
# - uses: "actions/checkout@v4"
30-
# - uses: "actions/setup-python@v5"
31-
# with:
32-
# python-version: ${{ matrix.python }}
33-
# - run: "python -m pip install -r requirements.in"
34-
# - run: "python -m pip install -r requirements-dev.in"
35-
# - run: "python -m pip install --editable ."
36-
# - run: "python -m pytest"
37-
# - env:
38-
# CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
39-
# uses: "codecov/codecov-action@v3"
17+
pytest:
18+
strategy:
19+
matrix:
20+
platform:
21+
- "macos-latest"
22+
- "ubuntu-latest"
23+
# - "windows-latest"
24+
python:
25+
- "3.10"
26+
runs-on: ${{ matrix.platform }}
27+
steps:
28+
- uses: "actions/checkout@v4"
29+
- uses: "actions/setup-python@v5"
30+
with:
31+
python-version: ${{ matrix.python }}
32+
- run: "python -m pip install -r requirements.in"
33+
- run: "python -m pip install -r requirements-dev.in"
34+
- run: "python -m pip install -r requirements-mgm.in"
35+
- run: "python -m pip install --editable ."
36+
- run: "python -m pytest"
37+
- env:
38+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
39+
uses: "codecov/codecov-action@v3"
40+
ruff:
41+
runs-on: "ubuntu-latest"
42+
steps:
43+
- uses: "actions/checkout@v4"
44+
- uses: "astral-sh/ruff-action@v1"
45+
with:
46+
args: "format --check"
47+
version: 0.7.3
4048
# pypi:
4149
# environment:
4250
# name: "pypi.org"
@@ -74,29 +82,25 @@ jobs:
7482
# - env:
7583
# GITHUB_TOKEN: "${{ github.token }}"
7684
# run: "gh release upload '${{ github.ref_name }}' dist/** --repo '${{ github.repository }}'"
77-
# ruff:
78-
# runs-on: "ubuntu-latest"
79-
# steps:
80-
# - uses: "actions/checkout@v4"
81-
# - uses: "chartboost/ruff-action@v1"
82-
# with:
83-
# args: "format --check"
84-
# testpypi:
85-
# environment:
86-
# name: "test.pypi.org"
87-
# url: "https://test.pypi.org/project/lbster"
88-
# needs:
89-
# - "build"
90-
# permissions:
91-
# id-token: "write"
92-
# runs-on: "ubuntu-latest"
93-
# steps:
94-
# - uses: "actions/download-artifact@v3"
95-
# with:
96-
# name: "python-package-distributions"
97-
# path: "dist/"
98-
# - uses: "pypa/gh-action-pypi-publish@release/v1"
99-
# with:
100-
# repository-url: "https://test.pypi.org/legacy/"
101-
# skip-existing: true
85+
testpypi:
86+
environment:
87+
name: "test.pypi.org"
88+
url: "https://test.pypi.org/project/lbster"
89+
needs:
90+
- "build"
91+
permissions:
92+
id-token: "write"
93+
runs-on: "ubuntu-latest"
94+
steps:
95+
- uses: "actions/download-artifact@v4"
96+
with:
97+
name: "python-package-distributions"
98+
path: "dist/"
99+
- uses: "pypa/gh-action-pypi-publish@release/v1"
100+
with:
101+
user: __token__
102+
password: ${{ secrets.TEST_PYPI_API_TOKEN }}
103+
repository-url: "https://test.pypi.org/legacy/"
104+
skip-existing: true
105+
verbose: true
102106
on: "push"

README.md

+23-20
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,18 @@
22
**L**anguage models for **B**iological **S**equence **T**ransformation and **E**volutionary **R**epresentation
33

44

5-
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
5+
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), [Joseph Kleinhenz](https://github.com/kleinhenz) and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
6+
7+
This repository contains training code and access to pre-trained language models for biological sequence data.
8+
9+
## Usage
610

7-
This repository contains code and access to pre-trained language models for biological sequence data.
811

912
<!---
1013
image credit: Amy Wang
1114
-->
1215
<p align="center">
13-
<img src="assets/lobster.png" width=200px>
16+
<img src="https://raw.githubusercontent.com/prescient-design/lobster/refs/heads/main/assets/lobster.png" width=200px>
1417
</p>
1518

1619

@@ -21,17 +24,19 @@ image credit: Amy Wang
2124
- [Install instructions](#install)
2225
- [Models](#main-models)
2326
- [Notebooks](#notebooks)
24-
- [Usage](#usage)
27+
- [Training and inference](#training)
28+
- [Contributing](#contributing)
2529
</details>
2630

2731
## Why you should use LBSTER <a name="why-use"></a>
2832
* LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
2933
* LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
3034
* LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
3135
* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
36+
3237
## Citations <a name="citations"></a>
3338
If you use the code and/or models, please cite the relevant papers.
34-
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
39+
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108)
3540
```bibtex
3641
@article{Frey2024.05.14.594108,
3742
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -48,21 +53,19 @@ For the `lbster` code base cite: [Cramming Protein Language Model Training in 24
4853
```
4954

5055

51-
<!-- For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
56+
For the `cb-lbster` code base cite: [Concept Bottleneck Language Models for Protein Design](https://arxiv.org/abs/2411.06090)
5257
```bibtex
53-
@article{Frey2024.05.14.594108,
54-
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
55-
title = {Cramming Protein Language Model Training in 24 GPU Hours},
56-
elocation-id = {2024.05.14.594108},
57-
year = {2024},
58-
doi = {10.1101/2024.05.14.594108},
59-
publisher = {Cold Spring Harbor Laboratory},
60-
URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
61-
eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
62-
journal = {bioRxiv}
58+
@article{ismail2024conceptbottlenecklanguagemodels,
59+
title={Concept Bottleneck Language Models For protein design},
60+
author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
61+
year={2024},
62+
eprint={2411.06090},
63+
archivePrefix={arXiv},
64+
primaryClass={cs.LG},
65+
url={https://arxiv.org/abs/2411.06090},
6366
}
6467
65-
``` -->
68+
```
6669

6770
## Install <a name="install"></a>
6871
clone the repo, cd into it and do `mamba env create -f env.yml`
@@ -118,7 +121,7 @@ Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example
118121
Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
119122

120123

121-
## Usage <a name="usage"></a>
124+
## Training and inference <a name="training"></a>
122125

123126
### Embedding
124127
The entrypoint `lobster_embed` is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_embed --help` or by looking in the src/lobster/hydra_config directory
@@ -141,15 +144,15 @@ model.naturalness(sequences)
141144
model.likelihood(sequences)
142145
```
143146

144-
## Training from scratch
147+
### Training from scratch
145148
The entrypoint `lobster_train` is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_train --help` or by looking in the src/lobster/hydra_config directory
146149

147150
To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do
148151
```bash
149152
lobster_train data.path_to_fasta="test_data/query.fasta" logger=csv paths.root_dir="."
150153
```
151154

152-
## Contributing
155+
## Contributing <a name="contributing"></a>
153156
Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.
154157

155158
### Install dev requirements and pre-commit hooks

docs/CONTRIBUTORS.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
* Karina Zadorozhny
2-
* Joseph Kleinhenz
32
* Matthieu Kirchmeyer
43
* Sai Pooja Mahajan
54
* Amy Wang

model_testing/inference.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Lobster Model Inference
22

33
import torch
4-
from lobster.model import LobsterPMLM, LobsterCBMPMLM
4+
from lobster.model import LobsterCBMPMLM, LobsterPMLM
55

66
# Define the test protein sequence
77
test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
88

99
# Determine the device
10-
device = 'cuda' if torch.cuda.is_available() else 'cpu'
10+
device = "cuda" if torch.cuda.is_available() else "cpu"
1111

1212
# Load the LobsterPMLM model
1313
lobster = LobsterPMLM("asalam91/lobster_24M").to(device)
@@ -29,7 +29,9 @@
2929

3030
# Get protein concepts
3131
test_protein_concepts = cb_lobster.sequences_to_concepts([test_protein])[-1]
32-
test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0] # All of the known concepts are the same for all tokens...
32+
test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][
33+
0
34+
] # All of the known concepts are the same for all tokens...
3335
test_protein_concepts_unknown_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1]
3436

3537
# Print results

model_testing/intervene.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
1+
import Levenshtein
12
import torch
23
from lobster.model import LobsterCBMPMLM
3-
import Levenshtein
44

55
device = "cuda" if torch.cuda.is_available() else "cpu"
66

77
# Load the LobsterCBMPMLM model
88
cb_lobster = LobsterCBMPMLM("asalam91/cb_lobster_24M").to(device)
99
cb_lobster.eval()
10-
print (cb_lobster.list_supported_concept())
10+
print(cb_lobster.list_supported_concept())
1111

12-
concept ="gravy"
13-
test_protein ="MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
12+
concept = "gravy"
13+
test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
1414

15-
[new_protien] = cb_lobster.intervene_on_sequences([test_protein],concept,edits=5,intervention_type="negative")
15+
[new_protien] = cb_lobster.intervene_on_sequences([test_protein], concept, edits=5, intervention_type="negative")
1616

1717

1818
print(new_protien)
19-
print(Levenshtein.distance(test_protein, new_protien))
19+
print(Levenshtein.distance(test_protein, new_protien))

pyproject.toml

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
[project]
22
name = "lbster"
3+
readme = "README.md"
34
description = "Language models for Biological Sequence Transformation and Evolutionary Representation."
45
authors = [{name = "Nathan C. Frey", email = "frey.nathan.nf1@gene.com"}]
5-
dynamic = ["version", "readme", "dependencies", "optional-dependencies"]
6+
dynamic = ["version", "dependencies", "optional-dependencies"]
67
requires-python = ">=3.10"
78

89
[build-system]
@@ -20,10 +21,10 @@ lobster_eval = "lobster.cmdline:eval_embed"
2021

2122
[tool.setuptools.dynamic]
2223
dependencies = {file = ["requirements.in"]}
23-
readme = {file = "README.md"}
2424

2525
[tool.setuptools.dynamic.optional-dependencies]
2626
dev = {file = ["requirements-dev.in"]}
27+
mgm = {file = ["requirements-mgm.in"]}
2728

2829
[tool.setuptools.packages.find]
2930
where = ["src"]
@@ -36,8 +37,8 @@ lobster = ["*.txt", "*.json", "*.yaml"]
3637
[tool.setuptools_scm]
3738
search_parent_directories = true
3839
version_scheme = "no-guess-dev"
39-
local_scheme = "node-and-date"
4040
fallback_version = "0.0.0"
41+
local_scheme = "no-local-version" # see https://github.com/pypa/setuptools-scm/issues/455
4142

4243
[tool.ruff]
4344
line-length = 120

requirements-mgm.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
selfies
2+
rdkit

requirements.in

+7-2
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,10 @@ fastparquet
2424
datasketch
2525
peft
2626
icecream
27-
selfies
28-
rdkit
27+
captum
28+
pooch
29+
edlib
30+
onnx
31+
onnxscript
32+
beignet[all]
33+
fair-esm

src/lobster/.DS_Store

-6 KB
Binary file not shown.

src/lobster/_imports.py

-1
This file was deleted.

src/lobster/data/__init__.py

-4
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
from ._constants import ( # nopycln: import
44
ESM_MODEL_NAMES,
55
)
6-
from ._cyno_pk_datamodule import CynoPKClearanceLightningDataModule
76
from ._dataframe_dataset_in_memory import ( # nopycln: import
87
DataFrameDatasetInMemory,
98
DataFrameLightningDataModule,
@@ -16,15 +15,12 @@
1615
)
1716
from ._minhasher import LobsterMinHasher
1817
from ._mmseqs import MMSeqsRunner
19-
from ._neglog_datamodule import NegLogDataModule
2018
from ._structure_datamodule import PDBDataModule
2119
from ._utils import ( # nopycln: import
2220
load_pickle,
2321
)
2422

2523
__all__ = [
26-
"ContactMapDataModule",
27-
"NegLogDataModule",
2824
"PDBDataModule",
2925
"DataFrameDatasetInMemory",
3026
]

src/lobster/data/_calm_datamodule.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
from typing import Any, Callable, Iterable, Optional, Sequence, TypeVar, Union
55

66
import torch.utils.data
7-
from lobster.transforms import Transform
87
from lightning import LightningDataModule
98
from torch import Generator
109
from torch.utils.data import DataLoader, Sampler
1110

1211
from lobster.datasets._calm_dataset import CalmDataset
1312
from lobster.tokenization import PmlmTokenizerTransform
13+
from lobster.transforms import Transform
1414

1515
T = TypeVar("T")
1616

0 commit comments

Comments
 (0)