Skip to content

Commit e0dad8f

Browse files
ismaia11kleinhenz
ismaia11
authored andcommitted
add cb-plm
1 parent 40ec9d7 commit e0dad8f

File tree

136 files changed

+176001
-1724
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+176001
-1724
lines changed

.gitattributes

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.ckpt filter=lfs diff=lfs merge=lfs -text

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ docs/_build
2525
scripts/*.fasta
2626
scripts/fastas/*.fasta
2727
scripts/combined_db*
28+
*_play.py

README.md

+56-11
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# LBSTER 🦞
22
**L**anguage models for **B**iological **S**equence **T**ransformation and **E**volutionary **R**epresentation
33

4-
## A language model library for rapid pre-training from scratch.
5-
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
4+
5+
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
66

77
This repository contains code and access to pre-trained language models for biological sequence data.
88

@@ -13,27 +13,25 @@ image credit: Amy Wang
1313
<img src="assets/lobster.png" width=200px>
1414
</p>
1515

16-
## Notice: Alpha Release
17-
This is an alpha release. The API is subject to change and the documentation is incomplete.
18-
*LBSTER is a work-in-progress. Contributions and feedback are encouraged!*
1916

2017
<details open><summary><b>Table of contents</b></summary>
2118

2219
- [Why you should use LBSTER](#why-use)
2320
- [Citations](#citations)
2421
- [Install instructions](#install)
2522
- [Models](#main-models)
23+
- [Notebooks](#notebooks)
2624
- [Usage](#usage)
2725
</details>
2826

2927
## Why you should use LBSTER <a name="why-use"></a>
3028
* LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
3129
* LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
3230
* LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
33-
31+
* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
3432
## Citations <a name="citations"></a>
3533
If you use the code and/or models, please cite the relevant papers.
36-
For the `lbster` code base:
34+
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
3735
```bibtex
3836
@article{Frey2024.05.14.594108,
3937
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -50,6 +48,22 @@ For the `lbster` code base:
5048
```
5149

5250

51+
<!-- For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
52+
```bibtex
53+
@article{Frey2024.05.14.594108,
54+
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
55+
title = {Cramming Protein Language Model Training in 24 GPU Hours},
56+
elocation-id = {2024.05.14.594108},
57+
year = {2024},
58+
doi = {10.1101/2024.05.14.594108},
59+
publisher = {Cold Spring Harbor Laboratory},
60+
URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
61+
eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
62+
journal = {bioRxiv}
63+
}
64+
65+
``` -->
66+
5367
## Install <a name="install"></a>
5468
clone the repo, cd into it and do `mamba env create -f env.yml`
5569
then from the root of the repo, do
@@ -58,21 +72,52 @@ pip install -e .
5872
```
5973

6074
## Main models you should use <a name="main-models"></a>
75+
76+
### Pretrained Models
77+
78+
#### Masked LMs
79+
| Shorthand | #params | Dataset | Description | Model checkpoint |
80+
|---------|------------|---------|------------------------------------------------------------|-------------|
81+
Lobster_24M | 24 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50| [lobster_24M](https://huggingface.co/asalam91/lobster_24M)
82+
Lobster_150M | 150 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50|[lobster_150M](https://huggingface.co/asalam91/lobster_150M)
83+
84+
85+
#### CB LMs
86+
| Shorthand | #params | Dataset | Description | Model checkpoint |
87+
|---------|------------|---------|------------------------------------------------------------|-------------|
88+
cb_Lobster_24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for protiens with 718 concepts | [cb_lobster_24M](https://huggingface.co/asalam91/cb_lobster_24M)
89+
cb_Lobster_150M | 150 M | uniref50+SwissProt |150M parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_150M](https://huggingface.co/asalam91/cb_lobster_150M)
90+
cb_Lobster_650M | 650 M | uniref50+SwissProt |650M parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_650M](https://huggingface.co/asalam91/cb_lobster_650M)
91+
cb_Lobster_3B | 3 B | uniref50+SwissProt |3B parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_3B](https://huggingface.co/asalam91/cb_lobster_3B)
92+
6193
### Loading a pre-trained model
6294
```python
63-
from lobster.model import LobsterPMLM, LobsterPCLM
64-
masked_language_model = LobsterPMLM.load_from_checkpoint(<path to ckpt>)
95+
from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
96+
masked_language_model = LobsterPMLM("asalam91/lobster_mlm_24M")
97+
concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M")
6598
causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>)
6699
```
67100
3D, cDNA, and dynamic models use the same classes.
68101

69-
NOTE: Pre-trained model checkpoints *may* be included in future releases!
70-
71102
**Models**
72103
* LobsterPMLM: masked language model (BERT-style encoder-only architecture)
104+
* LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder)
73105
* LobsterPCLM: causal language model (Llama-style decoder-only architecture)
74106
* LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)
75107

108+
109+
## Notebooks <a name="notebooks"></a>
110+
111+
### Representation learning
112+
113+
Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example on how extract embedding reprsentations from different models.
114+
115+
116+
### Concept Interventions
117+
118+
Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
119+
120+
76121
## Usage <a name="usage"></a>
77122

78123
### Embedding

env.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: lobster
1+
name: lobster_public
22
channels:
33
- anaconda
44
- pytorch
@@ -8,6 +8,7 @@ dependencies:
88
- pip
99
- python==3.10.*
1010
- setuptools
11+
- git-lfs
1112
- pip:
1213
- -r requirements.in
1314
- -r requirements-dev.in

model_testing/inference.py

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Lobster Model Inference
2+
3+
import torch
4+
from lobster.model import LobsterPMLM, LobsterCBMPMLM
5+
6+
# Define the test protein sequence
7+
test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
8+
9+
# Determine the device
10+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
11+
12+
# Load the LobsterPMLM model
13+
lobster = LobsterPMLM("asalam91/lobster_24M").to(device)
14+
lobster.eval()
15+
16+
# Get MLM representation
17+
mlm_representation = lobster.sequences_to_latents([test_protein])[-1]
18+
cls_token_mlm_representation = mlm_representation[:, 0, :]
19+
pooled_mlm_representation = torch.mean(mlm_representation, dim=1)
20+
21+
# Load the LobsterCBMPMLM model
22+
cb_lobster = LobsterCBMPMLM("asalam91/cb_lobster_24M").to(device)
23+
cb_lobster.eval()
24+
25+
# Get CB MLM representation
26+
cb_mlm_representation = cb_lobster.sequences_to_latents([test_protein])[-1]
27+
cls_token_cb_mlm_representation = cb_mlm_representation[:, 0, :]
28+
pooled_cb_mlm_representation = torch.mean(cb_mlm_representation, dim=1)
29+
30+
# Get protein concepts
31+
test_protein_concepts = cb_lobster.sequences_to_concepts([test_protein])[-1]
32+
test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0] # All of the known concepts are the same for all tokens...
33+
test_protein_concepts_unknown_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1]
34+
35+
# Print results
36+
print("CLS token MLM representation:", cls_token_mlm_representation.shape)
37+
print("Pooled MLM representation:", pooled_mlm_representation.shape)
38+
print("CLS token CB MLM representation:", cls_token_cb_mlm_representation.shape)
39+
print("Pooled CB MLM representation:", pooled_cb_mlm_representation.shape)
40+
print("Test protein concepts:", test_protein_concepts.shape)
41+
print("Test protein concepts embedding:", test_protein_concepts_emb.shape)
42+
print("Test protein unknown concepts embedding:", test_protein_concepts_unknown_emb.shape)

model_testing/intervene.py

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import torch
2+
from lobster.model import LobsterCBMPMLM
3+
import Levenshtein
4+
5+
device = "cuda" if torch.cuda.is_available() else "cpu"
6+
7+
# Load the LobsterCBMPMLM model
8+
cb_lobster = LobsterCBMPMLM("asalam91/cb_lobster_24M").to(device)
9+
cb_lobster.eval()
10+
print (cb_lobster.list_supported_concept())
11+
12+
concept ="gravy"
13+
test_protein ="MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
14+
15+
[new_protien] = cb_lobster.intervene_on_sequences([test_protein],concept,edits=5,intervention_type="negative")
16+
17+
18+
print(new_protien)
19+
print(Levenshtein.distance(test_protein, new_protien))

notebooks/01-inference.ipynb

+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"# Lobster Model Inference Notebook\n",
10+
"\n",
11+
"import torch\n",
12+
"from lobster.model import LobsterPMLM, LobsterCBMPMLM\n"
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": null,
18+
"metadata": {},
19+
"outputs": [],
20+
"source": [
21+
"# Define the test protein sequence\n",
22+
"test_protein = \"MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF\"\n",
23+
"\n",
24+
"# Determine the device\n",
25+
"device = 'cuda' if torch.cuda.is_available() else 'cpu'\n"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {},
32+
"outputs": [],
33+
"source": [
34+
"# Load the LobsterPMLM model\n",
35+
"lobster = LobsterPMLM(\"asalam91/lobster_24M\").to(device)\n",
36+
"lobster.eval()\n",
37+
"\n",
38+
"# Get MLM representation\n",
39+
"mlm_representation = lobster.sequences_to_latents([test_protein])[-1]\n",
40+
"cls_token_mlm_representation = mlm_representation[:, 0, :]\n",
41+
"pooled_mlm_representation = torch.mean(mlm_representation, dim=1)\n"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"\n",
51+
"# Load the LobsterCBMPMLM model\n",
52+
"cb_lobster = LobsterCBMPMLM(\"asalam91/cb_lobster_24M\").to(device)\n",
53+
"cb_lobster.eval()\n",
54+
"\n",
55+
"# Get CB MLM representation\n",
56+
"cb_mlm_representation = cb_lobster.sequences_to_latents([test_protein])[-1]\n",
57+
"cls_token_cb_mlm_representation = cb_mlm_representation[:, 0, :]\n",
58+
"pooled_cb_mlm_representation = torch.mean(cb_mlm_representation, dim=1)\n",
59+
"\n",
60+
"# Get protein concepts\n",
61+
"test_protein_concepts = cb_lobster.sequences_to_concepts([test_protein])[-1]\n",
62+
"test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0] # All of the known concepts are the same for all tokens...\n",
63+
"test_protein_concepts_unknown_emb"
64+
]
65+
}
66+
],
67+
"metadata": {
68+
"language_info": {
69+
"name": "python"
70+
}
71+
},
72+
"nbformat": 4,
73+
"nbformat_minor": 2
74+
}

notebooks/02-intervention.ipynb

+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "5476acf8",
6+
"metadata": {},
7+
"source": []
8+
},
9+
{
10+
"cell_type": "code",
11+
"execution_count": null,
12+
"id": "736f4148-e48d-40da-930b-f73560f048c8",
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"import torch\n",
17+
"from lobster.model import LobsterCBMPMLM\n",
18+
"import Levenshtein"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"id": "488c8208-7378-4846-b539-c673cbcf704a",
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
29+
"\n",
30+
"# Load the LobsterCBMPMLM model\n",
31+
"cb_lobster = LobsterCBMPMLM(\"asalam91/cb_lobster_24M\").to(device)\n",
32+
"cb_lobster.eval()\n"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"id": "5a6bb60c-a116-43c7-8ba2-362c9d2b210d",
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"cb_lobster.list_supported_concept()"
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": null,
48+
"id": "50615bfa-372e-4ad2-bf31-585d3861cb9d",
49+
"metadata": {},
50+
"outputs": [],
51+
"source": [
52+
"concept =\"gravy\"\n",
53+
"test_protein =\"MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF\"\n",
54+
"\n",
55+
"[new_protien] = cb_lobster.intervene_on_sequences([test_protein],concept,edits=5,intervention_type=\"negative\")\n"
56+
]
57+
},
58+
{
59+
"cell_type": "code",
60+
"execution_count": null,
61+
"id": "cb353041-bd72-4c35-bbdd-c820ef8e7af0",
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"print(new_protien)\n",
66+
"print(Levenshtein.distance(test_protein, new_protien))"
67+
]
68+
}
69+
],
70+
"metadata": {
71+
"kernelspec": {
72+
"display_name": "Python 3 (ipykernel)",
73+
"language": "python",
74+
"name": "python3"
75+
},
76+
"language_info": {
77+
"codemirror_mode": {
78+
"name": "ipython",
79+
"version": 3
80+
},
81+
"file_extension": ".py",
82+
"mimetype": "text/x-python",
83+
"name": "python",
84+
"nbconvert_exporter": "python",
85+
"pygments_lexer": "ipython3",
86+
"version": "3.11.4"
87+
}
88+
},
89+
"nbformat": 4,
90+
"nbformat_minor": 5
91+
}

py.typed

Whitespace-only changes.

0 commit comments

Comments
 (0)