Skip to content

Commit e21294a

Browse files
committed
readme
1 parent 5ea5e44 commit e21294a

File tree

1 file changed

+21
-18
lines changed

1 file changed

+21
-18
lines changed

README.md

+21-18
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@
44

55
`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), [Joseph Kleinhenz](https://github.com/kleinhenz) and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
66

7-
This repository contains code and access to pre-trained language models for biological sequence data.
7+
This repository contains training code and access to pre-trained language models for biological sequence data.
8+
9+
## Usage
10+
811

912
<!---
1013
image credit: Amy Wang
@@ -21,17 +24,19 @@ image credit: Amy Wang
2124
- [Install instructions](#install)
2225
- [Models](#main-models)
2326
- [Notebooks](#notebooks)
24-
- [Usage](#usage)
27+
- [Training and inference](#training)
28+
- [Contributing](#contributing)
2529
</details>
2630

2731
## Why you should use LBSTER <a name="why-use"></a>
2832
* LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
2933
* LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
3034
* LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
3135
* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
36+
3237
## Citations <a name="citations"></a>
3338
If you use the code and/or models, please cite the relevant papers.
34-
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
39+
For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108)
3540
```bibtex
3641
@article{Frey2024.05.14.594108,
3742
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -48,21 +53,19 @@ For the `lbster` code base cite: [Cramming Protein Language Model Training in 24
4853
```
4954

5055

51-
<!-- For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
56+
For the `cb-lbster` code base cite: [Concept Bottleneck Language Models for Protein Design](https://arxiv.org/abs/2411.06090)
5257
```bibtex
53-
@article{Frey2024.05.14.594108,
54-
author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
55-
title = {Cramming Protein Language Model Training in 24 GPU Hours},
56-
elocation-id = {2024.05.14.594108},
57-
year = {2024},
58-
doi = {10.1101/2024.05.14.594108},
59-
publisher = {Cold Spring Harbor Laboratory},
60-
URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
61-
eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
62-
journal = {bioRxiv}
58+
@article{ismail2024conceptbottlenecklanguagemodels,
59+
title={Concept Bottleneck Language Models For protein design},
60+
author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey},
61+
year={2024},
62+
eprint={2411.06090},
63+
archivePrefix={arXiv},
64+
primaryClass={cs.LG},
65+
url={https://arxiv.org/abs/2411.06090},
6366
}
6467
65-
``` -->
68+
```
6669

6770
## Install <a name="install"></a>
6871
clone the repo, cd into it and do `mamba env create -f env.yml`
@@ -118,7 +121,7 @@ Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example
118121
Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
119122

120123

121-
## Usage <a name="usage"></a>
124+
## Training and inference <a name="training"></a>
122125

123126
### Embedding
124127
The entrypoint `lobster_embed` is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_embed --help` or by looking in the src/lobster/hydra_config directory
@@ -141,15 +144,15 @@ model.naturalness(sequences)
141144
model.likelihood(sequences)
142145
```
143146

144-
## Training from scratch
147+
### Training from scratch
145148
The entrypoint `lobster_train` is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running `lobster_train --help` or by looking in the src/lobster/hydra_config directory
146149

147150
To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do
148151
```bash
149152
lobster_train data.path_to_fasta="test_data/query.fasta" logger=csv paths.root_dir="."
150153
```
151154

152-
## Contributing
155+
## Contributing <a name="contributing"></a>
153156
Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.
154157

155158
### Install dev requirements and pre-commit hooks

0 commit comments

Comments
 (0)