Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into feature-frontend
Browse files Browse the repository at this point in the history
  • Loading branch information
Saddler, Trey (NIH/NIEHS) [C] committed May 3, 2024
2 parents 5336b8a + d6bcae8 commit 40f6be5
Show file tree
Hide file tree
Showing 24 changed files with 91 additions and 14,647 deletions.
33 changes: 16 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,21 @@ NOTE: Development of this project will be moving to <https://github.com/NIEHS/To

# ToxPipe: Semi-autonomous AI integration of diverse toxicological data streams

List of participants and affiliations:

- Trey Saddler, NIEHS/DTT (Team Leader)
- Parker Combs, NIEHS/DTT
- William Gardner, UW-Madison
- Olawale Ogundeji, University of Leeds
- Dr. Yixing Han, NHGRI
- Dr. Virginie Grosboillot, University of Ljubljana (Slovenia)
- Dr. Grzegorz Boratyn, NCBI

Ad-hoc team members:

- Mike Conway, NIEHS/DTT
- Dr. Kamel Mansouri, NIEHS/DTT
- Dr. Daniel Zilber, NIEHS/DTT
- Dr. Scott Auerbach, NIEHS/DTT
ToxPipe is an application that makes use of large language models (LLMs), [Langchain](https://python.langchain.com/docs/get_started/introduction), and various tools and data sources to answer toxicological queries about chemicals. ToxPipe currently pulls information from [PubMed](https://pubmed.ncbi.nlm.nih.gov/), [PubChem](https://pubchem.ncbi.nlm.nih.gov/), [Semantic Scholar](https://www.semanticscholar.org/), [RDKit](https://www.rdkit.org/), and is inspired by and adapted from [ChemCrow](https://github.com/ur-whitelab/chemcrow-public).

## Contributors

| Name | Affiliation | Role |
| ------------------------ | ---------------------------------- | ------------- |
| Trey Saddler | NIEHS/DTT | Team Lead |
| Parker Combs | NIEHS/DTT | Tech Lead |
| Dr. Virginie Grosboillot | University of Ljubljana (Slovenia) | |
| Dr. Grzegorz Boratyn | NCBI | |
| Dr. Yixing Han | NHGRI | |
| Dr. David Li | NIA | |
| Olawale Ogundeji | University of Leeds | |
| Mike Conway | NIEHS/DTT | Ad-hoc member |
| Dr. Scott Auerbach | NIEHS/DTT | Ad-hoc member |

## Project Goals

Expand Down Expand Up @@ -46,7 +45,7 @@ In order to speed up development during the codeathon, we will be building off o

### Deployment

Our approach is to make ToxPipe easy to deploy by using Docker Compose. This will allow others to easily deploy and adapt the system for their needs.
Our approach is to make ToxPipe easy to deploy by using Docker. This will allow others to easily deploy and adapt the system for their needs.

#### Deploy Docker

Expand Down
58 changes: 0 additions & 58 deletions app/agents/prompts.py

This file was deleted.

24 changes: 16 additions & 8 deletions app/agents/prompts_chem.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# flake8: noqa
PREFIX = """
You are an expert chemist and your task is to respond to the question or
solve the problem to the best of your ability using the provided tools.
solve the problem to the best of your ability using the provided tools.
"""

FORMAT_INSTRUCTIONS = """
Expand All @@ -11,13 +11,13 @@
Complete format:
Thought: (reflect on your progress and decide what to do next)
Thought: (reflect on your progress and decide what to do next, using the output of the previous action as a guide)
Action: (the action name, should be one of [{tool_names}])
Action Input: (the input string to the action)
Action Input: (the input string to the action. Do not include the Thought itself as an action input)
OR
Final Answer: (the final answer to the original input question)
Final Answer: (the final answer to the original input question after using the appropriate tools)
"""

QUESTION_PROMPT = """
Expand All @@ -26,15 +26,23 @@
{tool_strings}
Use the tools provided, using the most specific tool available for each action.
Your final answer should contain all information necessary to answer the question and subquestions.
IMPORTANT: Your first step is to check the following, in this order, and plan your steps accordingly:
In particular, if the provided question asks about an action mechanism, assays, experiments, literature, or research, you MUST use the LiteratureSearch tool.
If you, at any point, find that a LiteratureSearch is required to answer the question, you MUST use the LiteratureSearch tool.
Your final answer should contain all information necessary to answer the question and subquestions. If you are asked to perform multiple tasks or are asked multiple questions, you should provide a final answer for each task.
IMPORTANT: Your first step is to evaluate the following numbered questions, in this order, and plan your steps accordingly. Do not skip any steps.
1. Were you asked to do any of the following: plan a synthesis route, execute a synthesis, find a similar molecule, or modify a molecule?
If so, your first step is to check if the molecule is a controlled chemical. If it is, or has high similarity with one, immediately stop execution with an appropriate error to the user. Do not continue.
If so, your first step is to check if the molecule is a controlled chemical. If it is a controlled chemical, then immediately stop execution with an appropriate error to the user. If it is not a controlled chemical, then check if the molecule has high similarity (i.e., similarity is greater than 0.35) with a controlled chemical. If it is a controlled chemical, then immediately stop execution with an appropriate error to the user.
If it is not a controlled chemical, then check if it has high similarity (i.e., similarity is greater than 0.35) with a controlled chemical.
If the molecule has high similarity to a controlled chemical, immediately stop execution with an appropriate error to the user. It is okay to continue if the molecule is not a controlled chemical or if the molecule has only a low similarity (i.e., less than or equal to 0.35) to a controlled chemical.
2. Does the question involve any molecules? If so, as a first step, check if any are controlled chemicals. If any are, include a warning in your final answer.
3. Were you asked to plan a synthesis route? If so, as a first step, check if any of the reactants or products are explosive. If any are, include a warning in your final answer.
4. Were you asked to execute a synthesis route? If so, check if any of the reactants or products are explosive. If any are, ask the user for permission to continue.
Do not skip these steps.
Do not skip steps 1, 2, 3, and 4. If the molecule is not a controlled chemical, does not have high similarity to a controlled chemical, and is not explosive, then ensure you thoroughly answer everything asked for in the following question.
If you, at any point, used the LiteratureSearch tool, you must include citations with each source's author(s), title, date of publication, journal of publication, and DOI, URL, or PMID for ALL the sources you used in your final answer.
Question: {input}
Expand Down
32 changes: 0 additions & 32 deletions app/agents/s2.py

This file was deleted.

11 changes: 2 additions & 9 deletions app/agents/tools.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
import os

from langchain import agents
from langchain.base_language import BaseLanguageModel

from .tp_tools import *
from .s2 import SemanticSearch

def make_tools(llm: BaseLanguageModel, verbose=True):
all_tools = [
Expand All @@ -15,11 +11,8 @@ def make_tools(llm: BaseLanguageModel, verbose=True):
SMILES2Weight(),
FuncGroups(),
ExplosiveCheck(),
#ControlChemCheck(),
Scholar2ResultLLM(llm=llm),
ControlChemCheck(),
SemanticSearch()
#SafetySummary(llm=llm),
# LitSearch(llm=llm, verbose=verbose),
Scholar2ResultLLM(llm=llm),
SafetySummary(llm=llm)
]
return all_tools
20 changes: 0 additions & 20 deletions app/agents/toxpipe.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
from typing import Optional

from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.chains import LLMChain
from rmrkl import ChatZeroShotAgent, RetryAgentExecutor

Expand Down Expand Up @@ -59,23 +56,6 @@ def __init__(
rephrase = ChatPromptTemplate.from_template(REPHRASE_TEMPLATE)
self.rephrase_chain = LLMChain(prompt=rephrase, llm=self.llm)

"""
self.agent_executor_gene = RetryAgentExecutor.from_agent_and_tools(
tools=self.tools,
agent=ChatZeroShotAgent.from_llm_and_tools(
self.llm,
self.tools,
suffix=GENE_SUFFIX,
format_instructions=GENE_FORMAT_INSTRUCTIONS,
question_prompt=GENE_QUESTION_PROMPT,
),
verbose=True,
max_iterations=max_iterations,
)
rephrase = ChatPromptTemplate.from_template(GENE_REPHRASE_TEMPLATE)
self.rephrase_chain = LLMChain(prompt=rephrase, llm=self.llm)
"""

def run(self, prompt):
outputs = self.agent_executor_chem({"input": prompt})
return outputs["output"]
Expand Down
6 changes: 3 additions & 3 deletions app/agents/tp_tools/rdkit.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from langchain.tools import BaseTool
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, rdMolDescriptors
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors

from .utils import *

Expand Down Expand Up @@ -58,7 +58,7 @@ def __init__(
super().__init__()

def _run(self, smiles: str) -> str:
mol = Chem.MolFromSmiles(smiles)
mol = Chem.MolFromSmiles(smiles.rstrip())
if mol is None:
return "Invalid SMILES string"
mol_weight = rdMolDescriptors.CalcExactMolWt(mol)
Expand Down
29 changes: 21 additions & 8 deletions app/agents/tp_tools/safety.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@ def query2smiles(
) -> str:
if url is None:
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/{}"

# Sanitize query
query = re.sub("Action.*", "", re.sub("Thought: .*", "", query)).rstrip()
r = requests.get(url.format(query, "property/IsomericSMILES/JSON"))

# convert the response to a json object
data = r.json()
# return the SMILES string
Expand Down Expand Up @@ -295,8 +299,6 @@ class SimilarControlChemCheck(BaseTool):
def _run(self, smiles: str) -> str:
"""Checks max similarity between compound and controlled chemicals.
Input SMILES string."""

#data_path = pkg_resources.resource_filename("chemcrow", "data/chem_wep_smi.csv")
data_path = os.path.join(dir_path, "data/chem_wep_smi.csv")

cw_df = pd.read_csv(data_path)
Expand All @@ -307,8 +309,7 @@ def _run(self, smiles: str) -> str:

max_sim = (
cw_df["smiles"]
.apply(lambda x: tanimoto(smiles, x))
.replace("Error: Not a valid SMILES string", 0.0)
.apply(lambda x: self.tanimoto(smiles, x))
.max()
)
if max_sim > 0.35:
Expand All @@ -320,9 +321,16 @@ def _run(self, smiles: str) -> str:
return (
f"{smiles} has a low similarity "
f"({max_sim:.4}) to a known controlled chemical."
"This is substance is safe, you may proceed with the original task."
)
except:
return "Tool error."

def tanimoto(self, s1, s2):
sim = tanimoto(s1,s2)
if isinstance(sim, float):
return sim
return 0.0

async def _arun(self, query: str) -> str:
"""Use the tool asynchronously."""
Expand All @@ -331,14 +339,15 @@ async def _arun(self, query: str) -> str:

class ControlChemCheck(BaseTool):
name = "ControlChemCheck"
description = "Input CAS number, True if molecule is a controlled chemical."
#description = "Input CAS number, True if molecule is a controlled chemical."
description = "Input: a chemical identifier such as a CASRN (CAS number), chemical name, or SMILES. Output: a statement saying if the input molecule is or is not a controlled chemical."
similar_control_chem_check = SimilarControlChemCheck()

def _run(self, query: str) -> str:
"""Checks if compound is a controlled chemical. Input CAS number."""
#data_path = pkg_resources.resource_filename("chemcrow", "data/chem_wep_smi.csv")
data_path = os.path.join(dir_path, "data/chem_wep_smi.csv")
cw_df = pd.read_csv(data_path)

try:
if is_smiles(query):
query_esc = re.escape(query)
Expand All @@ -361,8 +370,12 @@ def _run(self, query: str) -> str:
"controlled chemicals."
)
else:
# Get smiles of CAS number
smi = query2smiles(query)
smi = query
issmiles = is_smiles(query)
if issmiles == False:
# Get smiles of CAS number if not already SMILES
smi = query2smiles(query)

# Check similarity to known controlled chemicals
return self.similar_control_chem_check._run(smi)

Expand Down
Loading

0 comments on commit 40f6be5

Please sign in to comment.