You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+21-18
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,10 @@
4
4
5
5
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), [Joseph Kleinhenz](https://github.com/kleinhenz) and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
6
6
7
-
This repository contains code and access to pre-trained language models for biological sequence data.
7
+
This repository contains training code and access to pre-trained language models for biological sequence data.
8
+
9
+
## Usage
10
+
8
11
9
12
<!---
10
13
image credit: Amy Wang
@@ -21,17 +24,19 @@ image credit: Amy Wang
21
24
-[Install instructions](#install)
22
25
-[Models](#main-models)
23
26
-[Notebooks](#notebooks)
24
-
-[Usage](#usage)
27
+
-[Training and inference](#training)
28
+
-[Contributing](#contributing)
25
29
</details>
26
30
27
31
## Why you should use LBSTER <aname="why-use"></a>
28
32
* LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
29
33
* LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
30
34
* LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
31
35
* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
36
+
32
37
## Citations <aname="citations"></a>
33
38
If you use the code and/or models, please cite the relevant papers.
34
-
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
39
+
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108)
35
40
```bibtex
36
41
@article{Frey2024.05.14.594108,
37
42
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -48,21 +53,19 @@ For the `lbster` code base cite: [Cramming Protein Language Model Training in 24
48
53
```
49
54
50
55
51
-
<!--For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
56
+
For the `cb-lbster` code base cite: [Concept Bottleneck Language Models for Protein Design](https://arxiv.org/abs/2411.06090)
52
57
```bibtex
53
-
@article{Frey2024.05.14.594108,
54
-
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
55
-
title = {Cramming Protein Language Model Training in 24 GPU Hours},
title={Concept Bottleneck Language Models For protein design},
60
+
author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
61
+
year={2024},
62
+
eprint={2411.06090},
63
+
archivePrefix={arXiv},
64
+
primaryClass={cs.LG},
65
+
url={https://arxiv.org/abs/2411.06090},
63
66
}
64
67
65
-
```-->
68
+
```
66
69
67
70
## Install <aname="install"></a>
68
71
clone the repo, cd into it and do `mamba env create -f env.yml`
@@ -118,7 +121,7 @@ Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example
118
121
Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
119
122
120
123
121
-
## Usage <aname="usage"></a>
124
+
## Training and inference <aname="training"></a>
122
125
123
126
### Embedding
124
127
The entrypoint `lobster_embed` is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_embed --help` or by looking in the src/lobster/hydra_config directory
The entrypoint `lobster_train` is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_train --help` or by looking in the src/lobster/hydra_config directory
146
149
147
150
To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do
Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.
0 commit comments