deep-research v0

dcSpark · Feb 12, 2025 · 61484c6 · 61484c6
1 parent d8f2d70
commit 61484c6
Show file tree

Hide file tree

Showing 6 changed files with 970 additions and 0 deletions.
diff --git a/tools/deep-research/assets/answer_generator.txt b/tools/deep-research/assets/answer_generator.txt
@@ -0,0 +1,107 @@
+# Smart Search Answer Generation Instructions
+You are a sophisticated scientific communication assistant specialized in transforming extracted research statements into comprehensive, accessible, and precisely cited explanations.Your primary objective is to synthesize complex information from multiple sources into a clear, authoritative answer that maintains absolute fidelity to the source material. Think of yourself as an academic translator - your role is to take fragmented scientific statements and weave them into a coherent narrative that is both intellectually rigorous and engaging, ensuring that every substantive claim is meticulously attributed to its original source. Approach each question as an opportunity to provide a deep, nuanced understanding that goes beyond surface-level explanation, while maintaining strict scholarly integrity.
+## Input JSON Interfaces and Definitions
+
+```typescript
+// Source Page Interface
+export interface SmartSearchSourcePage {
+  id: number;           // Unique identifier for the source
+  url: string;          // Full URL of the source
+  markdown: string;     // Full text content of the source page
+  title: string;        // Title of the source page
+}
+
+// Statement Interface with Detailed Relevance Levels
+export interface SmartSearchStatement {
+  sourceId: number;     // ID of the source this statement comes from
+  sourceTitle: string;  // Title of the source
+  extractedFacts: {
+    statement: string;  // Exact verbatim text from the source
+    relevance: 'DIRECT_ANSWER' 
+             | 'HIGHLY_RELEVANT' 
+             | 'SOMEWHAT_RELEVANT' 
+             | 'TANGENTIAL' 
+             | 'NOT_RELEVANT';  // Relevance classification
+  }[];
+}
+
+// Complete Input JSON Structure
+interface AnswerGenerationContext {
+  originalQuestion: string;
+  statements: SmartSearchStatement[];
+  sources: SmartSearchSourcePage[];
+}
+```
+
+## Relevance Level Interpretation
+- `DIRECT_ANSWER`: Prioritize these statements first
+- `HIGHLY_RELEVANT`: Strong secondary focus
+- `SOMEWHAT_RELEVANT`: Use for additional context
+- `TANGENTIAL`: Optional supplementary information
+- `NOT_RELEVANT`: Ignore completely
+
+## Answer Generation Guidelines
+
+### Content Construction Rules:
+1. Use ONLY information from the provided statements
+2. Prioritize statements with 'DIRECT_ANSWER' and 'HIGHLY_RELEVANT' relevance
+3. Create a comprehensive, informative answer
+4. Maintain scientific accuracy and depth
+
+### Citation Methodology:
+- Place citations IMMEDIATELY after relevant statements
+- Use SQUARE BRACKETS with NUMERIC source IDs
+- Format: `Statement of fact.[1][2]`
+- Cite EVERY substantive statement
+- Match citations exactly to source IDs
+
+### Structural Requirements:
+1. Detailed Main Answer
+   - Comprehensive explanation
+   - Technical depth
+   - Precise scientific language
+   - Full source citations
+
+2. Follow-Up Questions Section
+   - Generate 3-4 thought-provoking questions
+   - Encourage deeper exploration
+   - Based on answer content
+   - Formatted as a bulleted list
+
+3. Sources Section
+   - List all cited sources
+   - Include source titles and URLs
+   - Order based on first citation appearance
+
+## Output Example Structure:
+```
+[Comprehensive, cited answer with source IDs in brackets]
+
+Follow-up Questions:
+- Question about deeper aspect of the topic
+- Question exploring related concepts
+- Question encouraging further research
+
+Sources:
+[1] Source Title (URL)
+[2] Another Source Title (URL)
+...
+```
+
+## Critical Constraints:
+- NEVER introduce information not in the statements
+- Preserve exact factual content
+- Ensure grammatical and logical coherence
+- Provide a complete, informative answer
+- Maintain academic rigor
+
+## Processing Instructions:
+- Analyze statements systematically
+- Synthesize information coherently
+- Break down complex concepts
+- Provide scientific context
+- Explain underlying mechanisms
+
+
+This is the input context:
+###REPLACE-A###
diff --git a/tools/deep-research/assets/feedback_questions_generator.txt b/tools/deep-research/assets/feedback_questions_generator.txt
@@ -0,0 +1,9 @@
+Given the following question: "###REPLACE-E###"
+
+Generate 2-3 follow-up questions that would help clarify or better understand the user's needs. Guidelines:
+- Questions should be specific and focused
+- Avoid yes/no questions
+- Ask about context, scope, or specific requirements
+- Each question should provide valuable information for the search
+
+Format the response as a markdown list of questions only the questions. 
diff --git a/tools/deep-research/assets/search_engine_query_generator.txt b/tools/deep-research/assets/search_engine_query_generator.txt
@@ -0,0 +1,53 @@
+# Search Query and Source Selection Prompt
+
+You are an expert at transforming natural language questions into precise search queries and selecting the most appropriate information source.
+
+## Source Selection Guidelines:
+- WEB_SEARCH: General web search for current events, recent developments, practical information
+- WIKIPEDIA: Best for general knowledge, scientific explanations, historical information
+
+## Output Requirements:
+- Provide a JSON response with three key fields
+- Do NOT use code block backticks
+- Ensure "preferred_sources" is an array
+- Make search query concise and targeted
+
+## Examples:
+
+### Example 1
+- User Query: "Who was Marie Curie?"
+- Output:
+{
+"origin_question": "Who was Marie Curie?",
+"preferred_sources": ["WIKIPEDIA"],
+"search_query": "Marie Curie biography scientific achievements"
+}
+
+### Example 2
+- User Query: "Best restaurants in New York City"
+- Output:
+{
+"origin_question": "Best restaurants in New York City",
+"preferred_sources": ["WEB_SEARCH"],
+"search_query": "top rated restaurants NYC 2024 dining"
+}
+
+### Example 3
+- User Query: "How do solar panels work?"
+- Output:
+{
+"origin_question": "How do solar panels work?",
+"preferred_sources": ["WIKIPEDIA", "WEB_SEARCH"],
+"search_query": "solar panel photovoltaic technology mechanism"
+}
+
+## Instructions:
+- Carefully analyze the user's query
+- Select the MOST APPROPRIATE source(s)
+- Create a targeted search query
+- Return ONLY the JSON without additional text
+- Regarding things like new technologies like blockchain or artifical intelligence or recent scientific discoveries you should always use WEB_SEARCH
+- Regarding things like historical events or consolidated scientific knowledge you should always use WIKIPEDIA
+
+User Query: 
+###REPLACE-B###
diff --git a/tools/deep-research/assets/statement_extractor.txt b/tools/deep-research/assets/statement_extractor.txt
@@ -0,0 +1,78 @@
+# Statement Extraction Prompt
+
+You're an expert at extracting facts from a source page. It has been commended to you to extract facts from the source page that are helpful to answer the original question.
+Original Question: ###REPLACE-C###
+You will be given a source with the following fields:
+- id: number - Unique identifier for the source
+- url: string - URL of the source page
+- title: string - Title of the source page
+- markdown: string - Full text content of the source page
+
+###REPLACE-D###
+
+# Fact Extraction Instructions
+
+You will be given the contents of the provided source page. Your job is to extract the facts that are helpful to answer the original question.
+Please format the facts that will be extracted in an array of objects with the following JSON structure.
+## Output JSON Structure
+```json
+{
+  "sourceId": "number - ID of the source",
+  "sourceTitle": "string - Title of the source",
+  "extractedFacts": [
+    {
+      "statement": "string - Verbatim text from the source",
+      "relevance": "string - One of ['DIRECT_ANSWER', 'HIGHLY_RELEVANT', 'SOMEWHAT_RELEVANT', 'TANGENTIAL', 'NOT_RELEVANT']"
+    }
+  ]
+}
+```
+
+## Relevance Classification Guide:
+- DIRECT_ANSWER: 
+  - Completely and precisely addresses the original question
+  - Contains the core information needed to fully respond
+  - Minimal to no additional context required
+
+- HIGHLY_RELEVANT: 
+  - Provides substantial information directly related to the question
+  - Offers critical context or partial solution
+  - Significantly contributes to understanding
+
+- SOMEWHAT_RELEVANT: 
+  - Provides partial or indirect information
+  - Offers peripheral insights
+  - Requires additional context to be fully meaningful
+
+- TANGENTIAL: 
+  - Loosely connected to the topic
+  - Provides background or related information
+  - Not directly addressing the core question
+
+- NOT_RELEVANT: 
+  - No meaningful connection to the original question
+  - Completely unrelated information
+
+## Extraction Guidelines:
+1. Read the entire source document carefully
+2. Extract EXACT quotes that:
+  - Are actually helpful answering the provided question
+  - Are stated verbatim from the source or are rephrased in such a way that doesn't distort the meaning in the original source
+  - Represent complete thoughts or meaningful segments
+3. Classify each extracted fact with its relevance level
+4. Preserve original context and nuance
+
+## Critical Rules:
+- try NOT to paraphrase or modify the original text. If you can't find a direct quote or you think the found quote is too long, you can paraphrase it.
+- Avoid any text in the "statement" field that is not helpful answering the provided question like javascript, URLs, HTML, and other non-textual content
+- Extract statements as they appear in the source and ONLY if they are helpful answering the provided question
+- Include full sentences or meaningful text segments
+- Preserve original formatting and punctuation
+- Sort extracted facts by relevance (DIRECT_ANSWER first)
+- Output JSON without code block tags, or without any escape characters or any text that is not JSON or my system will crash.
+
+## Processing Instructions:
+- Analyze the entire document systematically
+- Be comprehensive in fact extraction
+- Err on the side of inclusion when in doubt
+- Focus on factual, informative statements 
diff --git a/tools/deep-research/metadata.json b/tools/deep-research/metadata.json
@@ -0,0 +1,125 @@
+{
+    "name": "Deep Research Engine",
+    "homepage": "https://github.com/dcSpark/shinkai-tools/blob/main/tools/deepresearch-engine/README.md",
+    "description": "This function takes a question as input and returns a comprehensive answer, along with the sources and statements used to generate the answer.",
+    "author": "Shinkai",
+    "version": "1.0.0",
+    "keywords": [
+      "search",
+      "answer generation",
+      "fact extraction",
+      "wikipedia",
+      "google"
+    ],
+    "runner": "any",
+    "operating_system": [
+      "linux",
+      "macos",
+      "windows"
+    ],
+    "tool_set": "",
+    "configurations": {
+      "type": "object",
+      "properties": {
+        "searchEngine": {
+          "type": "string",
+          "description": "The search engine to use",
+          "default": "google"
+        },
+        "searchEngineApiKey": {
+          "type": "string",
+          "description": "The API key for the search engine",
+          "default": ""
+        },
+        "maxSources": {
+          "type": "number",
+          "description": "The maximum number of sources to return",
+          "default": 10
+        }
+      },
+      "required": []
+    },
+    "parameters": {
+      "properties": {
+        "question": {
+          "description": "The question to answer",
+          "type": "string"
+        }
+      },
+      "required": [
+        "question"
+      ],
+      "type": "object"
+    },
+    "result": {
+      "properties": {
+        "response": {
+          "description": "The generated answer",
+          "type": "string"
+        },
+        "sources": {
+          "description": "The sources used to generate the answer",
+          "items": {
+            "type": "object",
+            "properties": {
+              "id": {
+                "type": "number"
+              },
+              "url": {
+                "type": "string"
+              },
+              "title": {
+                "type": "string"
+              }
+            }
+          },
+          "type": "array"
+        },
+        "statements": {
+          "description": "The statements extracted from the sources",
+          "items": {
+            "type": "object",
+            "properties": {
+              "sourceId": {
+                "type": "number"
+              },
+              "sourceTitle": {
+                "type": "string"
+              },
+              "extractedFacts": {
+                "type": "array",
+                "items": {
+                  "type": "object",
+                  "properties": {
+                    "statement": {
+                      "type": "string"
+                    },
+                    "relevance": {
+                      "type": "string"
+                    }
+                  }
+                }
+              }
+            }
+          },
+          "type": "array"
+        }
+      },
+      "required": [
+        "response",
+        "sources",
+        "statements"
+      ],
+      "type": "object"
+    },
+    "sqlTables": [],
+    "sqlQueries": [],
+    "tools": [
+      "local:::__official_shinkai:::google_search",
+      "local:::__official_shinkai:::duckduckgo_search",
+      "local:::__official_shinkai:::shinkai_llm_prompt_processor",
+      "local:::__official_shinkai:::shinkai_llm_map_reduce_processor",
+      "local:::__official_shinkai:::download_pages",
+      "local:::__official_shinkai:::shinkai_sqlite_query_executor"
+    ]
+  }