datastax · bjchambers · Feb 5, 2025 · Feb 5, 2025 · Feb 5, 2025 · Feb 5, 2025
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -199,11 +199,6 @@ jobs:
       - name: Set up the environment
         uses: ./.github/actions/setup-python-env
 
-      # mkdocstrings uses `ruff` to format generated signatures.
-      # It doesn't find the version installed by `uv`
-      - name: Install ruff
-        uses: astral-sh/ruff-action@v3
-
       - name: Sync Docs Dependencies
         run: uv sync --all-packages --group=docs --all-extras
 

diff --git a/docs/get-started/adapters.md b/docs/get-started/adapters.md
@@ -2,5 +2,53 @@
 
 Adapters allow `graph-retriever` to connect to specific vector stores.
 
+| Vector Store                   | Supported                       | Collection Support               | Combined Adjacent Query         |
+| ------------------------------ | | | |
+| [DataStax Astra](#astra)       | :material-check-circle:{.green} | :material-check-circle:{.green}  | :material-check-circle:{.green} |
+| [OpenSearch](#opensearch)      | :material-check-circle:{.green} | :material-check-circle:{.green}  |                                 |
+| [Apache Cassandra](#cassandra) | :material-check-circle:{.green} | :material-alert-circle:{.yellow} |                                 |
+| [Chroma](#chroma)              | :material-check-circle:{.green} | :material-alert-circle:{.yellow} |                                 |
+
+__Supported__
+
+: Indicates whether a given store is completely supported (:material-check-circle:{.green}) or has limited support (:material-alert-circle:{.yellow}).
+
+__Collection Support__
+
+: Indicates whether the store supports lists in metadata values or not. Stores which do not support it directly (:material-alert-circle:{.yellow}) can be used by applying the [MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer] document transformer to documents before writing, which spreads the items of the collection into multiple metadata keys.
+
+__Combined Adjacent Query__
+
+: Whether the store supports the more efficient "combined adjacent query" to retrieve nodes adjacent to multiple edges in a single query. Stores which don't use the combined query instead use a fallback implementation which issues a query for each edge. Stores that support the combined adjacent query perform much better, especially when retrieving large numbers of nodes and/or dealing with high connectivity.
+
+!!! warning
+
+    Graph Retriever can be used with any of these supported Vector Stores. However, stores
+    that operate directly on normalized data and perform the combined adjacent query are
+    much more performant and better suited for production use. Stores like Chroma are best
+    employed for early experimentation, while it is generally recommended to use a store like DataStax AstraDB when scaling up.
+
+## Supported Stores
+
+### Astra
+
+[DataStax AstraDB](https://www.datastax.com/products/datastax-astra) is
+supported by the
+[`AstraAdapter`][langchain_graph_retriever.adapters.astra.AstraAdapter]. The adapter
+supports operating on metadata containing both primitive and list values. Additionally, it optimizes the request for nodes connected to multiple edges into a single query.
+
+### OpenSearch
+
+[OpenSearch](https://opensearch.org/) is supported by the [`OpenSearchAdapter`][langchain_graph_retriever.adapters.open_search.OpenSearchAdapter]. The adapter supports operating on metadata containing both primitive and list values. It does not combine the adjacent query.
+
+### Apache Cassandra {: #cassandra}
+
+[Apache Cassandra](https://cassandra.apache.org/) is supported by the [`CassandraAdapter`][langchain_graph_retriever.adapters.cassandra.CassandraAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.
+
+### Chroma
+
+[Chroma](https://www.trychroma.com/) is supported by the [`ChromaAdapter`][langchain_graph_retriever.adapters.chroma.ChromaAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.
+
 ## Implementation
+
 The [Adapter][graph_retriever.adapters.Adapter] interface may be implemented directly. For LangChain [VectorStores][langchain_core.vectorstores.base.VectorStore], [LangchainAdapter][langchain_graph_retriever.adapters.langchain.LangchainAdapter] and [DenormalizedAdapter][langchain_graph_retriever.adapters.langchain.DenormalizedAdapter] provide much of the necessary functionality.
diff --git a/docs/get-started/edges.md b/docs/get-started/edges.md
@@ -0,0 +1,41 @@
+# Edges
+
+Edges specify how content should be linked.
+Often, content in existing vector stores has metadata based on structured information.
+For example, a vector store containing articles may have information about the authors, keywords, and citations of those articles.
+__Such content can be traversed along relationships already present in that metadata!__
+See [Specifying Edges](#specifying-edges) for more on how edges are specified.
+
+## Specifying Edges {: #specifying-edges}
+
+```python title="Example content"
+Content(
+    id="article1",
+    content="...",
+    metadata={
+        "keywords": ["GPT", "GenAI"],
+        "authors": ["Ben", "Eric"],
+        "primary_author": "Eric",
+        "cites": ["article2", "article3"],
+    }
+)
+```
+
+1. `("keywords", "keywords")` connects to other articles about GPT and GenAI.
+2. `("authors", "authors")` connects to other articles by any of the same authors.
+3. `("authors", "primary_author")` connects to other articles whose primary author was Ben or Eric.
+4. `("cites", Id())` connects to the articles cited (by ID).
+5. `(Id(), "cites")` connects to articles which cite this one.
+6. `("cites", "cites")` connects to other articles with citations in common.
+
+## Edge Functions
+
+While sometimes the information to traverse is missing and the vector store
+needs to be re-populated, in other cases the information exist but not quite be
+in a suitable format for traversal. For instance, the `"authors"` field may
+contain a list of authors and their institution, making it impossible to link to
+other articles by the same author when they were at a different institution.
+
+In such cases, you can provide a custom
+[`EdgeFunction`][graph_retriever.edges.EdgeFunction] to extract the edges for
+traversal.
diff --git a/docs/get-started/index.md b/docs/get-started/index.md
@@ -8,9 +8,8 @@ We assume you already have a working `langchain` installation, including an LLM
 
 In that case, you only need to install `langchain-graph-retriever`:
 
-```{python}
-#| eval: False
-%pip install langchain langchain-graph-retriever
+```bash
+pip install langchain langchain-graph-retriever
 ```
 
 ## Preparing Data
@@ -51,59 +50,160 @@ For this guide, I have a JSON file with information about animals. Several examp
         "habitat": "rainforest"
     }
 }
-```
 
-```{python}
-import json
-from langchain_core.documents import Document
-animals = []
-with open("../../data/animals.jsonl", "r") as file:
-    for line in file:
-        data = json.loads(line.strip())
-        animals.append(Document(
-            id=data["id"],
-            page_content=data["text"],
-            metadata=data["metadata"],
-        ))
+```python title="Fetching Animal Data"
+from graph_rag_example_helpers.datasets.animals import fetch_documents
+animals = fetch_documents()
 ```
 
 ## Populating the Vector Store
 
-```{python}
-from dotenv import load_dotenv
-from langchain_astradb import AstraDBVectorStore
-from langchain_openai import OpenAIEmbeddings
-
-load_dotenv()
-vector_store = AstraDBVectorStore.from_documents(
-    collection_name="animals",
-    documents=animals,
-    embedding=OpenAIEmbeddings(),
-)
-```
+The following shows how to populate a variety of vector stores with the animal data.
+
+=== "Astra"
+
+    ```python
+    from dotenv import load_dotenv
+    from langchain_astradb import AstraDBVectorStore
+    from langchain_openai import OpenAIEmbeddings
+
+    load_dotenv()
+    vector_store = AstraDBVectorStore.from_documents(
+        collection_name="animals",
+        documents=animals,
+        embedding=OpenAIEmbeddings(),
+    )
+    ```
+
+=== "Apache Cassandra"
+
+    ```python
+    from langchain_community.vectorstores.cassandra import Cassandra
+    from langchain_openai import OpenAIEmbeddings
+    from langchain_graph_retriever.transformers.metadata_denormalizer import (
+        MetadataDenormalizer,
+    )
+
+    metadata_denormalizer = MetadataDenormalizer() # (1)!
+    vector_store = Cassandra.from_documents(
+        documents=list(metadata_denormalizer.transform_documents(animals)),
+        embedding=OpenAIEmbeddings(),
+        table_name="animals",
+    )
+    ```
+
+    1. Since Cassandra doesn't index items in lists for querying, it is necessary to
+    denormalize metadata containing list to be queried. By default, the
+    [MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
+    denormalizes all keys. It may be configured to only denormalize those
+    metadata keys used as edge targets.
+
+=== "OpenSearch"
+
+    ```python
+    from langchain_community.vectorstores import OpenSearchVectorSearch
+    from langchain_openai import OpenAIEmbeddings
+
+    vector_store = OpenSearchVectorSearch.from_documents(
+        opensearch_url=OPEN_SEARCH_URL,
+        index_name="animals",
+        embedding_function=OpenAIEmbeddings(),
+        engine="faiss",
+        documents=animals,
+    )
+    ```
+
+=== "Chroma"
+
+    ```python
+    from langchain_chroma.vectorstores import Chroma
+    from langchain_openai import OpenAIEmbeddings
+    from langchain_graph_retriever.transformers.metadata_denormalizer import (
+        MetadataDenormalizer,
+    )
+
+    metadata_denormalizer = MetadataDenormalizer() # (1)!
+    vector_store = Chroma.from_documents(
+        documents=list(metadata_denormalizer.transform_documents(animals)),
+        embedding=OpenAIEmbeddings(),
+        collection_name_name="animals",
+    )
+    ```
+
+    1. Since Chroma doesn't index items in lists for querying, it is necessary to
+    denormalize metadata containing list to be queried. By default, the
+    [MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
+    denormalizes all keys. It may be configured to only denormalize those
+    metadata keys used as edge targets.
 
 ## Simple Traversal
 
 For our first retrieval and graph traversal, we're going to start with a single animal best matching the query, and then traverse to other animals with the same `habitat` and/or `origin`.
 
-```{python}
-from graph_retriever.strategies import Eager
-from langchain_graph_retriever import GraphRetriever
+=== "Astra"
 
-simple = GraphRetriever(
-    # Adapt AstraDBVectorStore for use with Graph Retrievers.
-    store = vector_store,
-    # Define the relationships to navigate:
-    edges = [("habitat", "habitat"), ("origin", "origin")],
-    strategy = Eager(k=10, start_k=1, depth=2),
-)
-```
+    ```python
+    from graph_retriever.strategies import Eager
+    from langchain_graph_retriever import GraphRetriever
+
+    simple = GraphRetriever(
+        store = vector_store,
+        edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
+        strategy = Eager(k=10, start_k=1, depth=2),
+    )
+    ```
+
+=== "Apache Cassandra"
+
+    ```python
+    from graph_retriever.strategies import Eager
+    from langchain_graph_retriever import GraphRetriever
+    from langchain_graph_retriever.adapters.cassandra import CassandraAdapter
+
+    simple = GraphRetriever(
+        store = store = CassandraAdapter(vector_store, metadata_denormalizer, {"keywords"}),,
+        edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
+        strategy = Eager(k=10, start_k=1, depth=2),
+    )
+    ```
+
+=== "OpenSearch"
+
+    ```python
+    from graph_retriever.strategies import Eager
+    from langchain_graph_retriever import GraphRetriever
+
+    simple = GraphRetriever(
+        store = vector_store,
+        edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
+        strategy = Eager(k=10, start_k=1, depth=2),
+    )
+    ```
+
+
+=== "Chroma"
+
+    ```python
+    from graph_retriever.strategies import Eager
+    from langchain_graph_retriever import GraphRetriever
+    from langchain_graph_retriever.adapters.chroma import ChromaAdapter
+
+    simple = GraphRetriever(
+        store = ChromaAdapter(vector_store, metadata_denormalizer, {"keywords"}),
+        edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
+        strategy = Eager(k=10, start_k=1, depth=2),
+    )
+    ```
+
+!!! note "Denormalization"
+
+    The above code is exactly the same for all stores, however adapters for denormalized stores (Chroma and Apache Cassandra) require configuration to specify which metadata fields need to be rewritten when issuing queries.
 
 The above creates a graph traversing retriever that starts with the nearest animal (`start_k=1`), retrieves 10 documents (`k=10`) and limits the search to documents that are at most 2 steps away from the first animal (`depth=2`).
 
 The edges define how metadata values can be used for traversal. In this case, every animal is connected to other animals with the same habitat and/or same origin.
 
-```{python}
+```python
 simple_results = simple.invoke("what mammals could be found near a capybara")
 
 for doc in simple_results:
@@ -115,10 +215,7 @@ for doc in simple_results:
 `langchain-graph-retrievers` includes code for converting the document graph into a `networkx` graph, for rendering and other analysis.
 See @fig-document-graph
 
-```{python}
-#| code-fold: True
-#| label: fig-document-graph
-#| fig-cap: "Graph of retrieved documents"
+```python title="Graph retrieved documents"
 import networkx as nx
 import matplotlib.pyplot as plt
 from langchain_graph_retriever.document_graph import create_graph

diff --git a/docs/get-started/strategies.md b/docs/get-started/strategies.md
@@ -0,0 +1,29 @@
+# Strategies
+
+Strategies determine which nodes are selected during [traversal](./traversal.md).
+
+All strategies allow you to control how many nodes are retrieved (`k`) as well
+as how many nodes are found during the initial vector search (`start_k`) and
+each step of the traversal (`adjacent_k`) as well as bounding the nodes
+retrieved based on depth (`max_depth`).
+
+## Eager
+
+The [`Eager`][graph_retriever.strategies.Eager] strategy selects all of the discovered nodes at each step of the traversal.
+
+It doesn't support configuration beyond the standard options.
+
+## MMR
+
+The [`MMR`][graph_retriever.strategies.Mmr] strategy selects nodes with the
+highest maximum marginal relevance score at each iteration.
+
+It can be configured with a `lambda_mult` which controls the trade-off between relevance and diversity.
+
+## Scored
+
+The [`Scored`][graph_retriever.strategies.Scored] strategy applies a user-defined function to each node to assign a score, and selects a number of nodes with the highest scores.
+
+## User-Defined Strategies
+
+You can also implement your own [`Strategy`][graph_retriever.strategies.Strategy]. This allows you to control how discovered nodes are tracked and selected for traversal.
diff --git a/docs/get-started/traversal.md b/docs/get-started/traversal.md
@@ -0,0 +1,12 @@
+# Traversal
+
+At a high level, traversal performs the following steps:
+
+1. Retrieve `start_k` most similar to the `query` using vector search.
+2. Find the nodes reachable from the `initial_root_ids`.
+3. Discover the `start_k` nodes and the neighbors of the initial roots as "depth 0" candidates.
+4. Ask the strategy which nodes to visit next.
+5. If no more nodes to visit, exit and return the selected nodes.
+6. Record those nodes as selected and retrieve the top `adjacent_k` nodes reachable from them.
+7. Discover the newly reachable nodes (updating depths as needed).
+8. Goto 4.