You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+56-11
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# LBSTER 🦞
2
2
**L**anguage models for **B**iological **S**equence **T**ransformation and **E**volutionary **R**epresentation
3
3
4
-
## A language model library for rapid pre-training from scratch.
5
-
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
4
+
5
+
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
6
6
7
7
This repository contains code and access to pre-trained language models for biological sequence data.
8
8
@@ -13,27 +13,25 @@ image credit: Amy Wang
13
13
<imgsrc="assets/lobster.png"width=200px>
14
14
</p>
15
15
16
-
## Notice: Alpha Release
17
-
This is an alpha release. The API is subject to change and the documentation is incomplete.
18
-
*LBSTER is a work-in-progress. Contributions and feedback are encouraged!*
19
16
20
17
<detailsopen><summary><b>Table of contents</b></summary>
21
18
22
19
-[Why you should use LBSTER](#why-use)
23
20
-[Citations](#citations)
24
21
-[Install instructions](#install)
25
22
-[Models](#main-models)
23
+
-[Notebooks](#notebooks)
26
24
-[Usage](#usage)
27
25
</details>
28
26
29
27
## Why you should use LBSTER <aname="why-use"></a>
30
28
* LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
31
29
* LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
32
30
* LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
33
-
31
+
* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
34
32
## Citations <aname="citations"></a>
35
33
If you use the code and/or models, please cite the relevant papers.
36
-
For the `lbster` code base:
34
+
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
37
35
```bibtex
38
36
@article{Frey2024.05.14.594108,
39
37
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -50,6 +48,22 @@ For the `lbster` code base:
50
48
```
51
49
52
50
51
+
<!-- For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
52
+
```bibtex
53
+
@article{Frey2024.05.14.594108,
54
+
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
55
+
title = {Cramming Protein Language Model Training in 24 GPU Hours},
cb_Lobster_24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for protiens with 718 concepts | [cb_lobster_24M](https://huggingface.co/asalam91/cb_lobster_24M)
89
+
cb_Lobster_150M | 150 M | uniref50+SwissProt |150M parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_150M](https://huggingface.co/asalam91/cb_lobster_150M)
90
+
cb_Lobster_650M | 650 M | uniref50+SwissProt |650M parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_650M](https://huggingface.co/asalam91/cb_lobster_650M)
91
+
cb_Lobster_3B | 3 B | uniref50+SwissProt |3B parameter a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_3B](https://huggingface.co/asalam91/cb_lobster_3B)
92
+
61
93
### Loading a pre-trained model
62
94
```python
63
-
from lobster.model import LobsterPMLM, LobsterPCLM
64
-
masked_language_model = LobsterPMLM.load_from_checkpoint(<path to ckpt>)
95
+
from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example on how extract embedding reprsentations from different models.
114
+
115
+
116
+
### Concept Interventions
117
+
118
+
Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
"test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0] # All of the known concepts are the same for all tokens...\n",
0 commit comments