prescient-design
diff --git a/‎.gitattributes
+1 b/‎.gitattributes
+1
diff --git a/‎.gitignore
+1 b/‎.gitignore
+1
diff --git a/‎README.md
+56-11 b/‎README.md
+56-11
diff --git a/‎env.yml
+2-1 b/‎env.yml
+2-1
diff --git a/‎model_testing/inference.py
+42 b/‎model_testing/inference.py
+42
diff --git a/‎model_testing/intervene.py
+19 b/‎model_testing/intervene.py
+19
diff --git a/‎notebooks/01-inference.ipynb
+74 b/‎notebooks/01-inference.ipynb
+74
diff --git a/‎notebooks/02-intervention.ipynb
+91 b/‎notebooks/02-intervention.ipynb
+91
diff --git a/‎py.typed b/‎py.typed
@@ -0,0 +1 @@
+*.ckpt filter=lfs diff=lfs merge=lfs -text
@@ -25,3 +25,4 @@ docs/_build
 scripts/*.fasta
 scripts/fastas/*.fasta
 scripts/combined_db*
+*_play.py
@@ -1,8 +1,8 @@
 # LBSTER 🦞
 **L**anguage models for **B**iological **S**equence **T**ransformation and **E**volutionary **R**epresentation
 
-## A language model library for rapid pre-training from scratch.
-`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
+
+`lobster` is a "batteries included" language model library for proteins and other biological sequences. Led by [Nathan Frey](https://github.com/ncfrey), [Taylor Joren](https://github.com/taylormjs), [Aya Abdlesalam Ismail](https://github.com/ayaabdelsalam91), and [Allen Goodman](https://github.com/0x00b1), with many valuable contributions from [Contributors](docs/CONTRIBUTORS.md) across [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design).
 
 This repository contains code and access to pre-trained language models for biological sequence data.
 
@@ -13,27 +13,25 @@ image credit: Amy Wang
 <img src="assets/lobster.png" width=200px>
 </p>
 
-## Notice: Alpha Release
-This is an alpha release. The API is subject to change and the documentation is incomplete.
-*LBSTER is a work-in-progress. Contributions and feedback are encouraged!*
 
 <details open><summary><b>Table of contents</b></summary>
 
 - [Why you should use LBSTER](#why-use)
 - [Citations](#citations)
 - [Install instructions](#install)
 - [Models](#main-models)
+- [Notebooks](#notebooks)
 - [Usage](#usage)
 </details>
 
 ## Why you should use LBSTER <a name="why-use"></a>
 * LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
 * LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the [Frey Lab](https://ncfrey.github.io/) at [Prescient Design, Genentech](https://www.gene.com/scientists/our-scientists/prescient-design). The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
 * LBSTER is built with [beignet](https://github.com/Genentech/beignet/tree/main), a standard library for biological research, and integrated with [cortex](https://github.com/prescient-design/cortex/tree/main), a modular framework for multitask modeling, guided generation, and multi-modal models.
-
+* LBSTER supports concepts; we have a concept-bottleneck protein language model we refer to as CB-LBSTER, which supports 718 concepts.
 ## Citations <a name="citations"></a>
 If you use the code and/or models, please cite the relevant papers.
-For the `lbster` code base:
+For the `lbster` code base cite: [Cramming Protein Language Model Training in 24 GPU Hours](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
 ```bibtex
 @article{Frey2024.05.14.594108,
 	author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
@@ -50,6 +48,22 @@ For the `lbster` code base:
 ```
 
 
+<!-- For the `cb-lbster` code base cite: [Concept bottleneck Protien Language](https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108})
+```bibtex
+@article{Frey2024.05.14.594108,
+	author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir},
+	title = {Cramming Protein Language Model Training in 24 GPU Hours},
+	elocation-id = {2024.05.14.594108},
+	year = {2024},
+	doi = {10.1101/2024.05.14.594108},
+	publisher = {Cold Spring Harbor Laboratory},
+	URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108},
+	eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf},
+	journal = {bioRxiv}
+}
+
+``` -->
+
 ## Install <a name="install"></a>
 clone the repo, cd into it and do `mamba env create -f env.yml`
 then from the root of the repo, do
@@ -58,21 +72,52 @@ pip install -e .
 ```
 
 ## Main models you should use <a name="main-models"></a>
+
+### Pretrained Models
+
+#### Masked LMs
+| Shorthand | #params | Dataset |  Description  | Model checkpoint |
+|---------|------------|---------|------------------------------------------------------------|-------------|
+Lobster_24M | 24 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50| [lobster_24M](https://huggingface.co/asalam91/lobster_24M)
+Lobster_150M | 150 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50|[lobster_150M](https://huggingface.co/asalam91/lobster_150M)
+
+
+#### CB LMs
+| Shorthand | #params | Dataset |  Description  | Model checkpoint |
+|---------|------------|---------|------------------------------------------------------------|-------------|
+cb_Lobster_24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for protiens with 718 concepts | [cb_lobster_24M](https://huggingface.co/asalam91/cb_lobster_24M)
+cb_Lobster_150M | 150 M | uniref50+SwissProt |150M parameter a protein  concept bottleneck model for protiens with 718 concepts|[cb_lobster_150M](https://huggingface.co/asalam91/cb_lobster_150M)
+cb_Lobster_650M | 650 M | uniref50+SwissProt |650M parameter  a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_650M](https://huggingface.co/asalam91/cb_lobster_650M)
+cb_Lobster_3B | 3 B | uniref50+SwissProt |3B parameter  a protein concept bottleneck model for protiens with 718 concepts|[cb_lobster_3B](https://huggingface.co/asalam91/cb_lobster_3B)
+
 ### Loading a pre-trained model
 ```python
-from lobster.model import LobsterPMLM, LobsterPCLM
-masked_language_model = LobsterPMLM.load_from_checkpoint(<path to ckpt>)
+from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM
+masked_language_model = LobsterPMLM("asalam91/lobster_mlm_24M")
+concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M")
 causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>)
 ```
 3D, cDNA, and dynamic models use the same classes.
 
-NOTE: Pre-trained model checkpoints *may* be included in future releases!
-
 **Models**
 * LobsterPMLM: masked language model (BERT-style encoder-only architecture)
+* LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder)
 * LobsterPCLM: causal language model (Llama-style decoder-only architecture)
 * LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)
 
+
+## Notebooks <a name="notebooks"></a>
+
+### Representation learning
+
+Check out [jupyter notebook tutorial](notebooks/01-inference.ipynb) for example on how extract embedding reprsentations from different models.
+
+
+### Concept Interventions
+
+Check out [jupyter notebook tutorial](notebooks/02-intervention.ipynb) for example on to intervene on different concepts for our concept-bottleneck models class.
+
+
 ## Usage <a name="usage"></a>
 
 ### Embedding
 
@@ -1,4 +1,4 @@
-name: lobster
+name: lobster_public
 channels:
   - anaconda
   - pytorch
@@ -8,6 +8,7 @@ dependencies:
   - pip
   - python==3.10.*
   - setuptools
+  - git-lfs
   - pip:
     - -r requirements.in
     - -r requirements-dev.in
@@ -0,0 +1,42 @@
+# Lobster Model Inference
+
+import torch
+from lobster.model import LobsterPMLM, LobsterCBMPMLM
+
+# Define the test protein sequence
+test_protein = "MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
+
+# Determine the device
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+# Load the LobsterPMLM model
+lobster = LobsterPMLM("asalam91/lobster_24M").to(device)
+lobster.eval()
+
+# Get MLM representation
+mlm_representation = lobster.sequences_to_latents([test_protein])[-1]
+cls_token_mlm_representation = mlm_representation[:, 0, :]
+pooled_mlm_representation = torch.mean(mlm_representation, dim=1)
+
+# Load the LobsterCBMPMLM model
+cb_lobster = LobsterCBMPMLM("asalam91/cb_lobster_24M").to(device)
+cb_lobster.eval()
+
+# Get CB MLM representation
+cb_mlm_representation = cb_lobster.sequences_to_latents([test_protein])[-1]
+cls_token_cb_mlm_representation = cb_mlm_representation[:, 0, :]
+pooled_cb_mlm_representation = torch.mean(cb_mlm_representation, dim=1)
+
+# Get protein concepts
+test_protein_concepts = cb_lobster.sequences_to_concepts([test_protein])[-1]
+test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0]  # All of the known concepts are the same for all tokens...
+test_protein_concepts_unknown_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1]
+
+# Print results
+print("CLS token MLM representation:", cls_token_mlm_representation.shape)
+print("Pooled MLM representation:", pooled_mlm_representation.shape)
+print("CLS token CB MLM representation:", cls_token_cb_mlm_representation.shape)
+print("Pooled CB MLM representation:", pooled_cb_mlm_representation.shape)
+print("Test protein concepts:", test_protein_concepts.shape)
+print("Test protein concepts embedding:", test_protein_concepts_emb.shape)
+print("Test protein unknown concepts embedding:", test_protein_concepts_unknown_emb.shape)
@@ -0,0 +1,19 @@
+import torch
+from lobster.model import LobsterCBMPMLM
+import Levenshtein
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+# Load the LobsterCBMPMLM model
+cb_lobster = LobsterCBMPMLM("asalam91/cb_lobster_24M").to(device)
+cb_lobster.eval()
+print (cb_lobster.list_supported_concept())
+
+concept ="gravy"
+test_protein ="MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF"
+
+[new_protien] = cb_lobster.intervene_on_sequences([test_protein],concept,edits=5,intervention_type="negative")
+
+
+print(new_protien)
+print(Levenshtein.distance(test_protein, new_protien))
@@ -0,0 +1,74 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Lobster Model Inference Notebook\n",
+    "\n",
+    "import torch\n",
+    "from lobster.model import LobsterPMLM, LobsterCBMPMLM\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the test protein sequence\n",
+    "test_protein = \"MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF\"\n",
+    "\n",
+    "# Determine the device\n",
+    "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the LobsterPMLM model\n",
+    "lobster = LobsterPMLM(\"asalam91/lobster_24M\").to(device)\n",
+    "lobster.eval()\n",
+    "\n",
+    "# Get MLM representation\n",
+    "mlm_representation = lobster.sequences_to_latents([test_protein])[-1]\n",
+    "cls_token_mlm_representation = mlm_representation[:, 0, :]\n",
+    "pooled_mlm_representation = torch.mean(mlm_representation, dim=1)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Load the LobsterCBMPMLM model\n",
+    "cb_lobster = LobsterCBMPMLM(\"asalam91/cb_lobster_24M\").to(device)\n",
+    "cb_lobster.eval()\n",
+    "\n",
+    "# Get CB MLM representation\n",
+    "cb_mlm_representation = cb_lobster.sequences_to_latents([test_protein])[-1]\n",
+    "cls_token_cb_mlm_representation = cb_mlm_representation[:, 0, :]\n",
+    "pooled_cb_mlm_representation = torch.mean(cb_mlm_representation, dim=1)\n",
+    "\n",
+    "# Get protein concepts\n",
+    "test_protein_concepts = cb_lobster.sequences_to_concepts([test_protein])[-1]\n",
+    "test_protein_concepts_emb = cb_lobster.sequences_to_concepts_emb([test_protein])[-1][0]  # All of the known concepts are the same for all tokens...\n",
+    "test_protein_concepts_unknown_emb"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,91 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "5476acf8",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "736f4148-e48d-40da-930b-f73560f048c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from lobster.model import LobsterCBMPMLM\n",
+    "import Levenshtein"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "488c8208-7378-4846-b539-c673cbcf704a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "\n",
+    "# Load the LobsterCBMPMLM model\n",
+    "cb_lobster = LobsterCBMPMLM(\"asalam91/cb_lobster_24M\").to(device)\n",
+    "cb_lobster.eval()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5a6bb60c-a116-43c7-8ba2-362c9d2b210d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cb_lobster.list_supported_concept()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50615bfa-372e-4ad2-bf31-585d3861cb9d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "concept =\"gravy\"\n",
+    "test_protein =\"MGAGASAEEKHSRELEKKLKEDAEKDARTVKLLLLGAGESGKSTIVKQMKIIHQDGYSLEECLEFIAIIYGNTLQSILAIVRAMTTLNIQYGDSARQDDARKLMHMADTIEEGTMPKEMSDIIQRLWKDSGIQACFERASEYQLNDSAGYYLSDLERLVTPGYVPTEQDVLRSRVKTTGIIETQFSFKDLNFRMFDVGGQRSERKKWIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHLFNSICNHRYFATTSIVLFLNKKDVFFEKIKKAHLSICFPDYDGPNTYEDAGNYIKVQFLELNMRRDVKEIYSHMTCATDTQNVKFVFDAVTDIIIKENLKDCGLF\"\n",
+    "\n",
+    "[new_protien] = cb_lobster.intervene_on_sequences([test_protein],concept,edits=5,intervention_type=\"negative\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb353041-bd72-4c35-bbdd-c820ef8e7af0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(new_protien)\n",
+    "print(Levenshtein.distance(test_protein, new_protien))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+*.ckpt filter=lfs diff=lfs merge=lfs -text`