From 0b1ed84d674f29907470dfba47246a322de0b505 Mon Sep 17 00:00:00 2001 From: Brian Godsey Date: Tue, 4 Feb 2025 13:59:28 -0500 Subject: [PATCH] Graph docs cleanup 202501 (#654) * fix-prereqs-link * Making changes to the main graph RAG page. * Moving usable content to the main grpah RAG page. * Deleting the two graph sub-pages after moving some content to the main graph page. * remove-pages-from-nav * style-guide-cleanup-note * use-page-alias-instead-of-redirect --------- Co-authored-by: Mendon Kissling <59585235+mendonk@users.noreply.github.com> --- docs/modules/ROOT/nav.adoc | 2 - docs/modules/knowledge-graph/pages/index.adoc | 65 +++++----- .../pages/knowledge-graph.adoc | 117 ------------------ .../pages/knowledge-store.adoc | 54 -------- 4 files changed, 28 insertions(+), 210 deletions(-) delete mode 100644 docs/modules/knowledge-graph/pages/knowledge-graph.adoc delete mode 100644 docs/modules/knowledge-graph/pages/knowledge-store.adoc diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index de7f86ccb..1c2db9413 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -26,8 +26,6 @@ .Graph Libraries * xref:knowledge-graph:index.adoc[] -* xref:knowledge-graph:knowledge-graph.adoc[] -* xref:knowledge-graph:knowledge-store.adoc[] .Introduction to RAG * xref:intro-to-rag:index.adoc[] diff --git a/docs/modules/knowledge-graph/pages/index.adoc b/docs/modules/knowledge-graph/pages/index.adoc index 991b33d60..7d9602b07 100644 --- a/docs/modules/knowledge-graph/pages/index.adoc +++ b/docs/modules/knowledge-graph/pages/index.adoc @@ -1,19 +1,17 @@ = Introduction to Graph-Based Knowledge Extraction and Traversal +:page-aliases: knowledge-graph:knowledge-graph.adoc, knowledge-graph:knowledge-store.adoc -RAGStack offers two libraries supporting knowledge graph extraction and traversal, `ragstack-ai-knowledge-graph` and `ragstack-ai-knowledge-store`. +[IMPORTANT] +==== +The `ragstack-ai-knowledge-graph` and `ragstack-ai-knowledge-store` libraries are no longer under development. -A knowledge graph represents information as **nodes**. Nodes are connected by **edges** indicating relationships between them. Each edge includes the source (for example, "Marie Curie" the person), the target ("Nobel Prize" the award) and a type, indicating how the source relates to the target (for example, “won”). +Instead, you can find the latest tools and techniques for working with knowledge graphs and graph RAG in the https://github.com/datastax/graph-rag[Graph RAG project]. -A graph database isn't required to use the knowledge graph libraries - RAGStack uses Astra DB or Apache Cassandra to store and retrieve graphs. +If you have further questions, contact https://support.datastax.com/[DataStax Support]. +==== -The `ragstack-ai-knowledge-graph` library offers **entity-centric** knowledge graph extraction and traversal. It extracts a knowledge graph from unstructured information and creates nodes from **entities**, or concepts (for example, "Seattle"). +A knowledge graph represents information as **nodes**. Nodes are connected by **edges** indicating relationships between them. Each edge includes the source (for example, "Marie Curie" the person), the target ("Nobel Prize" the award) and a type, indicating how the source relates to the target (for example, “won”). -The `ragstack-ai-knowledge-store` library offers **content-centric** knowledge graph extraction and traversal. It extracts a knowledge graph from unstructured information and creates nodes from **content** (for example, a specific document about Seattle). - -[IMPORTANT] -==== -This feature is currently under development and has not been fully tested. It is not supported for use in production environments. Please use this feature in testing and development environments only. -==== == What's the difference between knowledge graphs and vector similarity search? @@ -28,49 +26,42 @@ From a developer's perspective, a knowledge graph is built into a RAG pipeline s For example: consider a tech support system, where you find an article that is similar to your question, and it says. "If you have trouble with step 4, see this article for more information". Even if "more information" is not similar to your original question, it likely provides more information. -The article's "see more information" is an example of an edge in a knowledge graph. The edge connects the initial article to additional information, indicating that the two are related. This relationship would not be captured in a similarity search. +The article's HTML links can be examples of edges in a knowledge graph. These edges connect the initial article to additional information, indicating that they are related. This relationship would not be captured in a vector similarity search. These edges also increase the diversity of results. Within the same tech support system, if you retrieve 100 chunks that are highly similar to the question, you have retrieved 100 chunks that are also highly similar to themselves. Following edges to linked information increases diversity. -== The `ragstack-ai-knowledge-graph` library -The `ragstack-ai-knowledge-graph` library contains functions for the extraction and traversal of **entity-centric** knowledge graphs. +== How is Knowledge Graph RAG different from RAG? + +Short answer: it isn't. Knowledge graphs are a method of doing RAG, but with a different representation of the information. + +RAG with similarity search creates a vector representation of information based on chunks of text. The query is compared to the question, and the most similar chunks are returned as the answer. -To install the package, run: +Knowledge graph RAG extracts a knowledge graph from information, and stores the graph representation in a vector or graph knowledge store. -[source,bash] ----- -pip install ragstack-ai-knowledge-graph ----- +Instead of a similarity search query, the graph store is **traversed** to extract a sub-graph of the knowledge graph's edges and properties. For example, a query for "Marie Curie" returns a sub-graph of nodes representing her relationships, accomplishments, and other relevant information - the context. -To install the library as an extra with the RAGStack Langchain package, run: +You're telling the graph store to "start with this node, and show me the relationships to a depth of 2 nodes outwards." -[source,bash] ----- -pip install "ragstack-ai-langchain[knowledge-graph]" ----- -For more information, see xref:knowledge-graph.adoc[]. +== What's the difference between entity-centric and content-centric knowledge graphs? -== The `ragstack-ai-knowledge-store` library +**Entity-centric knowledge graphs** capture edge relationships between entities. +A knowledge graph is extracted with an LLM from unstructured information, and its entities and their edge relationships are stored in a vector or graph store. -The `ragstack-ai-knowledge-store` library contains functions for creating a **content-centric** vector-and-graph store. This store combines the benefits of vector stores with the context and relationships of a related edges. +However, extracting this entity-centric knowledge graph from unstructured information is difficult, time-consuming, and error-prone. A user has to guide the LLM on the kinds of nodes and relationships to be extracted with a schema, and if the knowledge schema changes, the graph has to be processed again. The context advantages of entity-centric knowledge graphs are great, but the cost to build and maintain them is much higher than just chunking and embedding content to a vector store. -To install the package, run: +**Content-centric knowledge graphs** offer a compromise between the ease and scalability of vector similarity search, and the context and relationships of entity-centric knowledge graphs. -[source,bash] ----- -pip install ragstack-ai-knowledge-store ----- +The content-centric approach starts with nodes that represent content (a specific document about Seattle), instead of concepts or entities (a node representing Seattle). A node may represent a table, an image, or a section of a document. Since the node represents the original content, the nodes are exactly what is stored when using vector search. -To install the library as an extra with the RAGStack Langchain package, run: +Unstructured content is loaded, chunked, and written to a vector store. +Each chunk can be run through a variety of analyses to identify links. For example, links in the content may turn into `links_to edges`, and keywords may be extracted from the chunk to link up with other chunks on the same topic. -[source,bash] ----- -pip install "ragstack-ai-langchain[knowledge-store]" ----- +To add edges, each chunk may be annotated with URLs that its content represents, or each chunk may be associated with keywords. -For more information, see xref:knowledge-store.adoc[]. +Retrieval is where the benefits of vector search and content-centric traversal come together. +The query's initial starting points in the knowledge graph are identified based on vector similarity to the question, and then additional chunks are selected by following edges from that node. Including nodes that are related both by embedding distance (similarity) and graph distance (related) leads to a more diverse set of chunks with deeper context and less hallucinations. diff --git a/docs/modules/knowledge-graph/pages/knowledge-graph.adoc b/docs/modules/knowledge-graph/pages/knowledge-graph.adoc deleted file mode 100644 index e678fffe9..000000000 --- a/docs/modules/knowledge-graph/pages/knowledge-graph.adoc +++ /dev/null @@ -1,117 +0,0 @@ -= Knowledge Graph RAG - -Knowledge Graph is a RAGStack library that provides graph-based representation and retrieval of information. It is designed to store and retrieve information in a way that is more efficient and accurate than vector-based similarity search over Document chunks. - -See the xref:examples:knowledge-graph.adoc[Knowledge graph example code] to get started using Knowledge Graph RAG. - -[IMPORTANT] -==== -This feature is currently under development and has not been fully tested. It is not supported for use in production environments. Please use this feature in testing and development environments only. -==== - -== The `ragstack-ai-knowledge-graph` library - -The `ragstack-ai-knowledge-graph` library contains functions for the extraction and traversal of **entity-centric** knowledge graphs. - -To install the package, run: - -[source,bash] ----- -pip install ragstack-ai-knowledge-graph ----- - -To install the library as an extra with the RAGStack Langchain package, run: - -[source,bash] ----- -pip install "ragstack-ai-langchain[knowledge-graph]" ----- - -== How is Knowledge Graph different from RAG? - -Short answer: it isn't. Knowledge graphs are a method of doing RAG, but with a different representation of the information. - -RAG with similarity search creates a vector representation of information based on chunks of text. The query is compared to the question, and the most similar chunks are returned as the answer. - -Knowledge graph RAG extracts a knowledge graph from information, and stores the graph representation in a vector or graph knowledge store. - -Instead of a similarity search query, the graph store is **traversed** to extract a sub-graph of the knowledge graph's edges and properties. For example, a query for "Marie Curie" returns a sub-graph of nodes representing her relationships, accomplishments, and other relevant information - the context. - -You're telling the graph store to "start with this node, and show me the relationships to a depth of 2 nodes outwards." - -Here is how the xref:examples:knowledge-graph.adoc#query-graph-store[Knowledge graph example code] uses the Knowledge Graph library to extract a sub-graph around Marie Curie: - -[source,python] ----- -from ragstack_knowledge_graph.traverse import Node - -graph_store.as_runnable(steps=2).invoke(Node("Marie Curie", "Person")) ----- - -Result: - -[source,plain] ----- -{Marie Curie(Person) -> Chemist(Profession): HAS_PROFESSION, - Marie Curie(Person) -> French(Nationality): HAS_NATIONALITY, - Marie Curie(Person) -> Nobel Prize(Award): WON, - Marie Curie(Person) -> Physicist(Profession): HAS_PROFESSION, - Marie Curie(Person) -> Pierre Curie(Person): MARRIED_TO, - Marie Curie(Person) -> Polish(Nationality): HAS_NATIONALITY, - Marie Curie(Person) -> Professor(Profession): HAS_PROFESSION, - Marie Curie(Person) -> Radioactivity(Scientific concept): RESEARCHED, - Marie Curie(Person) -> Radioactivity(Scientific field): RESEARCHED_IN, - Marie Curie(Person) -> University Of Paris(Organization): WORKED_AT, - Pierre Curie(Person) -> Nobel Prize(Award): WON} ----- - -As with RAG, this sub-graph context is then dropped into the prompt to generate answers. - -[source,python] ----- -ANSWER_PROMPT = ( - "The original question is given below." - "This question has been used to retrieve information from a knowledge graph." - "The matching triples are shown below." - "Use the information in the triples to answer the original question.\n\n" - "Original Question: {question}\n\n" - "Knowledge Graph Triples:\n{context}\n\n" - "Response:" -) - -chain = ( - { "question": RunnablePassthrough() } - # extract_entities is provided by the Cassandra knowledge graph library - # and extracts entitise as shown above. - | RunnablePassthrough.assign(entities = extract_entities(llm)) - | RunnablePassthrough.assign( - # graph_store.as_runnable() is provided by the CassandraGraphStore - # and takes one or more entities and retrieves the relevant sub-graph(s). - triples = itemgetter("entities") | graph_store.as_runnable()) - | RunnablePassthrough.assign( - context = itemgetter("triples") | RunnableLambda(_combine_relations)) - | ChatPromptTemplate.from_messages([ANSWER_PROMPT]) - | llm -) ----- - -Result: - -[source,bash] ----- -Nodes: [Node(id='Marie Curie', type='Person'), Node(id='Polish', type='Nationality'), Node(id='French', type='Nationality'), Node(id='Physicist', type='Profession'), Node(id='Chemist', type='Profession'), Node(id='Radioactivity', type='Scientific concept'), Node(id='Nobel Prize', type='Award'), Node(id='Pierre Curie', type='Person'), Node(id='University Of Paris', type='Institution'), Node(id='Professor', type='Profession')] -Relationships: [Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Polish', type='Nationality'), type='HAS_NATIONALITY'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='French', type='Nationality'), type='HAS_NATIONALITY'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Physicist', type='Profession'), type='IS_A'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Chemist', type='Profession'), type='IS_A'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Radioactivity', type='Scientific concept'), type='RESEARCHED'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Nobel Prize', type='Award'), type='WON'), Relationship(source=Node(id='Pierre Curie', type='Person'), target=Node(id='Nobel Prize', type='Award'), type='WON'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Pierre Curie', type='Person'), type='MARRIED_TO'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='University Of Paris', type='Institution'), type='WORKED_AT'), Relationship(source=Node(id='Marie Curie', type='Person'), target=Node(id='Professor', type='Profession'), type='IS_A')] -Chain Response: content='Marie Curie was a physicist, chemist, and professor. She was of French and Polish nationality. She was married to Pierre Curie and both of them won the Nobel Prize. She worked at the University of Paris and researched radioactivity.' response_metadata={'token_usage': {'completion_tokens': 50, 'prompt_tokens': 308, 'total_tokens': 358}, 'model_name': 'gpt-4', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-79178e44-64a0-4077-8b90-f21fd004f745-0' ----- - -== Knowledge Graph, RAGStack, and Astra DB - -Knowledge graph extracts graphs from documents using the LLMGraphTransformer library from Langchain, stores the graphs in a Cassandra database, and traverses the graph to extract sub-graphs for answering questions with a https://github.com/datastax/ragstack-ai/blob/main/libs/knowledge-graph/ragstack_knowledge_graph/traverse.py[custom function]. - -A graph database or query language isn't required to use the knowledge graph library. - -Retrieving the sub-knowledge graph around a few nodes is a simple graph traversal, while graph DBs are designed for much more complex queries searching for paths with specific sequences of properties. Sub-knowledge graph traversal is often only to a depth of 2 or 3, since nodes which are farther removed become irrelevant to the question pretty quickly. This can be expressed as a few rounds of simple queries (one for each step) or a SQL join. - -Eliminating the need for a separate graph database makes it easier to use knowledge graphs. -Using Astra DB or Cassandra simplifies transactional writes to both the graph and other data stored in the same place, and likely scales better. -Finally, using RAGStack ensures Langchain components like LLMGraphTransformer remain stable. \ No newline at end of file diff --git a/docs/modules/knowledge-graph/pages/knowledge-store.adoc b/docs/modules/knowledge-graph/pages/knowledge-store.adoc deleted file mode 100644 index db777286f..000000000 --- a/docs/modules/knowledge-graph/pages/knowledge-store.adoc +++ /dev/null @@ -1,54 +0,0 @@ -= {graph-store} - -{graph-store} is a hybrid vector-and-graph store that combines the benefits of vector stores with the context and relationships of related edges between chunks. - -See the xref:examples:knowledge-store.adoc[{graph-store} example code] to get started with {graph-store}. - -[IMPORTANT] -==== -This feature is currently under development and has not been fully tested. It is not supported for use in production environments. Please use this feature in testing and development environments only. -==== - -== The `ragstack-ai-knowledge-store` library - -The `ragstack-ai-knowledge-store` library contains functions for creating a hybrid vector-and-graph knowledge store. This store combines the benefits of vector stores with the context and relationships of a related edges. - -To install the package, run: - -[source,bash] ----- -pip install ragstack-ai-knowledge-store ----- - -To install the library as an extra with the RAGStack Langchain package, run: - -[source,bash] ----- -pip install "ragstack-ai-langchain[knowledge-store]" ----- - -== What's the difference between entity-centric and content-centric knowledge graphs? - -**Entity-centric knowledge graphs** (like xref:knowledge-graph.adoc[]) capture edge relationships between entities. -A knowledge graph is extracted with an LLM from unstructured information, and its entities and their edge relationships are stored in a vector or graph store. - -However, extracting this entity-centric knowledge graph from unstructured information is difficult, time-consuming, and error-prone. A user has to guide the LLM on the kinds of nodes and relationships to be extracted with a schema, and if the knowledge schema changes, the graph has to be processed again. The context advantages of entity-centric knowledge graphs are great, but the cost to build and maintain them is much higher than just chunking and embedding content to a vector store. - -**Content-centric knowledge graphs** (like xref:knowledge-store.adoc[]) offer a compromise between the ease and scalability of vector similarity search, and the context and relationships of entity-centric knowledge graphs. - -The content-centric approach starts with nodes that represent content (a specific document about Seattle), instead of concepts or entities (a node representing Seattle). A node may represent a table, an image, or a section of a document. Since the node represents the original content, the nodes are exactly what is stored when using vector search. - -Unstructured content is loaded, chunked, and written to a vector store. -Each chunk can be run through a variety of analyses to identify links. For example, links in the content may turn into `links_to edges`, and keywords may be extracted from the chunk to link up with other chunks on the same topic. - -To add edges, each chunk may be annotated with URLs that its content represents, or each chunk may be associated with keywords. - -Retrieval is where the benefits of vector search and content-centric traversal come together. -The query's initial starting points in the knowledge graph are identified based on vector similarity to the question, and then additional chunks are selected by following edges from that node. Including nodes that are related both by embedding distance (similarity) and graph distance (related) leads to a more diverse set of chunks with deeper context and less hallucinations. - -For a step-by-step example, see the xref:examples:knowledge-store.adoc[{graph-store} example code]. - - - - -