Merge remote-tracking branch 'origin/main' into feature-frontend

NCBI-Codeathons · May 3, 2024 · 40f6be5 · 40f6be5
2 parents 5336b8a + d6bcae8
commit 40f6be5
Show file tree

Hide file tree

Showing 24 changed files with 91 additions and 14,647 deletions.
diff --git a/README.md b/README.md
@@ -2,22 +2,21 @@ NOTE: Development of this project will be moving to <https://github.com/NIEHS/To
 
 # ToxPipe: Semi-autonomous AI integration of diverse toxicological data streams
 
-List of participants and affiliations:
-
-- Trey Saddler, NIEHS/DTT (Team Leader)
-- Parker Combs, NIEHS/DTT
-- William Gardner, UW-Madison
-- Olawale Ogundeji, University of Leeds
-- Dr. Yixing Han, NHGRI
-- Dr. Virginie Grosboillot, University of Ljubljana (Slovenia)
-- Dr. Grzegorz Boratyn, NCBI
-
-Ad-hoc team members:
-
-- Mike Conway, NIEHS/DTT
-- Dr. Kamel Mansouri, NIEHS/DTT
-- Dr. Daniel Zilber, NIEHS/DTT
-- Dr. Scott Auerbach, NIEHS/DTT
+ToxPipe is an application that makes use of large language models (LLMs), [Langchain](https://python.langchain.com/docs/get_started/introduction), and various tools and data sources to answer toxicological queries about chemicals. ToxPipe currently pulls information from [PubMed](https://pubmed.ncbi.nlm.nih.gov/), [PubChem](https://pubchem.ncbi.nlm.nih.gov/), [Semantic Scholar](https://www.semanticscholar.org/), [RDKit](https://www.rdkit.org/), and is inspired by and adapted from [ChemCrow](https://github.com/ur-whitelab/chemcrow-public).
+
+## Contributors
+
+| Name                     | Affiliation                        | Role          |
+| ------------------------ | ---------------------------------- | ------------- |
+| Trey Saddler             | NIEHS/DTT                          | Team Lead     |
+| Parker Combs             | NIEHS/DTT                          | Tech Lead     |
+| Dr. Virginie Grosboillot | University of Ljubljana (Slovenia) |               |
+| Dr. Grzegorz Boratyn     | NCBI                               |               |
+| Dr. Yixing Han           | NHGRI                              |               |
+| Dr. David Li             | NIA                                |               |
+| Olawale Ogundeji         | University of Leeds                |               |
+| Mike Conway              | NIEHS/DTT                          | Ad-hoc member |
+| Dr. Scott Auerbach       | NIEHS/DTT                          | Ad-hoc member |
 
 ## Project Goals
 
@@ -46,7 +45,7 @@ In order to speed up development during the codeathon, we will be building off o
 
 ### Deployment
 
-Our approach is to make ToxPipe easy to deploy by using Docker Compose. This will allow others to easily deploy and adapt the system for their needs.
+Our approach is to make ToxPipe easy to deploy by using Docker. This will allow others to easily deploy and adapt the system for their needs.
 
 #### Deploy Docker
 

diff --git a/app/agents/prompts.py b/app/agents/prompts.py
diff --git a/app/agents/prompts_chem.py b/app/agents/prompts_chem.py
@@ -1,7 +1,7 @@
 # flake8: noqa
 PREFIX = """
 You are an expert chemist and your task is to respond to the question or
-solve the problem to the best of your ability using the provided tools.
+solve the problem to the best of your ability using the provided tools. 
 """
 
 FORMAT_INSTRUCTIONS = """
@@ -11,13 +11,13 @@
 
 Complete format:
 
-Thought: (reflect on your progress and decide what to do next)
+Thought: (reflect on your progress and decide what to do next, using the output of the previous action as a guide)
 Action: (the action name, should be one of [{tool_names}])
-Action Input: (the input string to the action)
+Action Input: (the input string to the action. Do not include the Thought itself as an action input)
 
 OR
 
-Final Answer: (the final answer to the original input question)
+Final Answer: (the final answer to the original input question after using the appropriate tools)
 """
 
 QUESTION_PROMPT = """
@@ -26,15 +26,23 @@
 {tool_strings}
 
 Use the tools provided, using the most specific tool available for each action.
-Your final answer should contain all information necessary to answer the question and subquestions.
 
-IMPORTANT: Your first step is to check the following, in this order, and plan your steps accordingly:
+In particular, if the provided question asks about an action mechanism, assays, experiments, literature, or research, you MUST use the LiteratureSearch tool.
+If you, at any point, find that a LiteratureSearch is required to answer the question, you MUST use the LiteratureSearch tool.
+
+Your final answer should contain all information necessary to answer the question and subquestions. If you are asked to perform multiple tasks or are asked multiple questions, you should provide a final answer for each task.
+
+IMPORTANT: Your first step is to evaluate the following numbered questions, in this order, and plan your steps accordingly. Do not skip any steps.
 1. Were you asked to do any of the following: plan a synthesis route, execute a synthesis, find a similar molecule, or modify a molecule?
-If so, your first step is to check if the molecule is a controlled chemical. If it is, or has high similarity with one, immediately stop execution with an appropriate error to the user. Do not continue.
+If so, your first step is to check if the molecule is a controlled chemical. If it is a controlled chemical, then immediately stop execution with an appropriate error to the user. If it is not a controlled chemical, then check if the molecule has high similarity (i.e., similarity is greater than 0.35) with a controlled chemical. If it is a controlled chemical, then immediately stop execution with an appropriate error to the user. 
+If it is not a controlled chemical, then check if it has high similarity (i.e., similarity is greater than 0.35) with a controlled chemical.
+If the molecule has high similarity to a controlled chemical, immediately stop execution with an appropriate error to the user. It is okay to continue if the molecule is not a controlled chemical or if the molecule has only a low similarity (i.e., less than or equal to 0.35) to a controlled chemical.
 2. Does the question involve any molecules? If so, as a first step, check if any are controlled chemicals. If any are, include a warning in your final answer.
 3. Were you asked to plan a synthesis route? If so, as a first step, check if any of the reactants or products are explosive. If any are, include a warning in your final answer.
 4. Were you asked to execute a synthesis route? If so, check if any of the reactants or products are explosive. If any are, ask the user for permission to continue.
-Do not skip these steps.
+Do not skip steps 1, 2, 3, and 4. If the molecule is not a controlled chemical, does not have high similarity to a controlled chemical, and is not explosive, then ensure you thoroughly answer everything asked for in the following question.
+
+If you, at any point, used the LiteratureSearch tool, you must include citations with each source's author(s), title, date of publication, journal of publication, and DOI, URL, or PMID for ALL the sources you used in your final answer.
 
 
 Question: {input}

diff --git a/app/agents/s2.py b/app/agents/s2.py
diff --git a/app/agents/tools.py b/app/agents/tools.py
@@ -1,10 +1,6 @@
-import os
-
-from langchain import agents
 from langchain.base_language import BaseLanguageModel
 
 from .tp_tools import *
-from .s2 import SemanticSearch
 
 def make_tools(llm: BaseLanguageModel, verbose=True):
     all_tools = [
@@ -15,11 +11,8 @@ def make_tools(llm: BaseLanguageModel, verbose=True):
         SMILES2Weight(),
         FuncGroups(),
         ExplosiveCheck(),
-        #ControlChemCheck(),
-        Scholar2ResultLLM(llm=llm),
         ControlChemCheck(),
-        SemanticSearch()
-        #SafetySummary(llm=llm),
-        # LitSearch(llm=llm, verbose=verbose),
+        Scholar2ResultLLM(llm=llm),
+        SafetySummary(llm=llm)
     ]
     return all_tools
diff --git a/app/agents/toxpipe.py b/app/agents/toxpipe.py
@@ -1,9 +1,6 @@
-from typing import Optional
-
 from dotenv import load_dotenv
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_openai import AzureChatOpenAI
-from langchain_core.output_parsers import StrOutputParser
 from langchain.chains import LLMChain
 from rmrkl import ChatZeroShotAgent, RetryAgentExecutor
 
@@ -59,23 +56,6 @@ def __init__(
         rephrase = ChatPromptTemplate.from_template(REPHRASE_TEMPLATE)
         self.rephrase_chain = LLMChain(prompt=rephrase, llm=self.llm)
 
-        """
-        self.agent_executor_gene = RetryAgentExecutor.from_agent_and_tools(
-            tools=self.tools,
-            agent=ChatZeroShotAgent.from_llm_and_tools(
-                self.llm,
-                self.tools,
-                suffix=GENE_SUFFIX,
-                format_instructions=GENE_FORMAT_INSTRUCTIONS,
-                question_prompt=GENE_QUESTION_PROMPT,
-            ),
-            verbose=True,
-            max_iterations=max_iterations,
-        )
-        rephrase = ChatPromptTemplate.from_template(GENE_REPHRASE_TEMPLATE)
-        self.rephrase_chain = LLMChain(prompt=rephrase, llm=self.llm)
-        """
-
     def run(self, prompt):
         outputs = self.agent_executor_chem({"input": prompt})
         return outputs["output"]

diff --git a/app/agents/tp_tools/rdkit.py b/app/agents/tp_tools/rdkit.py
@@ -1,6 +1,6 @@
 from langchain.tools import BaseTool
-from rdkit import Chem, DataStructs
-from rdkit.Chem import AllChem, rdMolDescriptors
+from rdkit import Chem
+from rdkit.Chem import rdMolDescriptors
 
 from .utils import *
 
@@ -58,7 +58,7 @@ def __init__(
         super().__init__()
 
     def _run(self, smiles: str) -> str:
-        mol = Chem.MolFromSmiles(smiles)
+        mol = Chem.MolFromSmiles(smiles.rstrip())
         if mol is None:
             return "Invalid SMILES string"
         mol_weight = rdMolDescriptors.CalcExactMolWt(mol)

diff --git a/app/agents/tp_tools/safety.py b/app/agents/tp_tools/safety.py
@@ -25,7 +25,11 @@ def query2smiles(
 ) -> str:
     if url is None:
         url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{}/{}"
+
+    # Sanitize query
+    query = re.sub("Action.*", "", re.sub("Thought: .*", "", query)).rstrip()
     r = requests.get(url.format(query, "property/IsomericSMILES/JSON"))
+
     # convert the response to a json object
     data = r.json()
     # return the SMILES string
@@ -295,8 +299,6 @@ class SimilarControlChemCheck(BaseTool):
     def _run(self, smiles: str) -> str:
         """Checks max similarity between compound and controlled chemicals.
         Input SMILES string."""
-
-        #data_path = pkg_resources.resource_filename("chemcrow", "data/chem_wep_smi.csv")
         data_path = os.path.join(dir_path, "data/chem_wep_smi.csv")
 
         cw_df = pd.read_csv(data_path)
@@ -307,8 +309,7 @@ def _run(self, smiles: str) -> str:
 
             max_sim = (
                 cw_df["smiles"]
-                .apply(lambda x: tanimoto(smiles, x))
-                .replace("Error: Not a valid SMILES string", 0.0)
+                .apply(lambda x: self.tanimoto(smiles, x))
                 .max()
             )
             if max_sim > 0.35:
@@ -320,9 +321,16 @@ def _run(self, smiles: str) -> str:
                 return (
                     f"{smiles} has a low similarity "
                     f"({max_sim:.4}) to a known controlled chemical."
+                    "This is substance is safe, you may proceed with the original task."
                 )
         except:
             return "Tool error."
+
+    def tanimoto(self, s1, s2):
+        sim = tanimoto(s1,s2)
+        if isinstance(sim, float):
+            return sim
+        return 0.0
 
     async def _arun(self, query: str) -> str:
         """Use the tool asynchronously."""
@@ -331,14 +339,15 @@ async def _arun(self, query: str) -> str:
 
 class ControlChemCheck(BaseTool):
     name = "ControlChemCheck"
-    description = "Input CAS number, True if molecule is a controlled chemical."
+    #description = "Input CAS number, True if molecule is a controlled chemical."
+    description = "Input: a chemical identifier such as a CASRN (CAS number), chemical name, or SMILES. Output: a statement saying if the input molecule is or is not a controlled chemical."
     similar_control_chem_check = SimilarControlChemCheck()
 
     def _run(self, query: str) -> str:
         """Checks if compound is a controlled chemical. Input CAS number."""
-        #data_path = pkg_resources.resource_filename("chemcrow", "data/chem_wep_smi.csv")
         data_path = os.path.join(dir_path, "data/chem_wep_smi.csv")
         cw_df = pd.read_csv(data_path)
+
         try:
             if is_smiles(query):
                 query_esc = re.escape(query)
@@ -361,8 +370,12 @@ def _run(self, query: str) -> str:
                     "controlled chemicals."
                 )
             else:
-                # Get smiles of CAS number
-                smi = query2smiles(query)
+                smi = query
+                issmiles = is_smiles(query)
+                if issmiles == False:
+                    # Get smiles of CAS number if not already SMILES
+                    smi = query2smiles(query)
+
                 # Check similarity to known controlled chemicals
                 return self.similar_control_chem_check._run(smi)