Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs landing page and content #117

Merged
merged 3 commits into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,6 @@ jobs:
- name: Set up the environment
uses: ./.github/actions/setup-python-env

# mkdocstrings uses `ruff` to format generated signatures.
# It doesn't find the version installed by `uv`
- name: Install ruff
uses: astral-sh/ruff-action@v3

- name: Sync Docs Dependencies
run: uv sync --all-packages --group=docs --all-extras

Expand Down
48 changes: 48 additions & 0 deletions docs/get-started/adapters.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,53 @@

Adapters allow `graph-retriever` to connect to specific vector stores.

| Vector Store | Supported | Collection Support | Combined Adjacent Query |
| ------------------------------ | | | |
| [DataStax Astra](#astra) | :material-check-circle:{.green} | :material-check-circle:{.green} | :material-check-circle:{.green} |
| [OpenSearch](#opensearch) | :material-check-circle:{.green} | :material-check-circle:{.green} | |
| [Apache Cassandra](#cassandra) | :material-check-circle:{.green} | :material-alert-circle:{.yellow} | |
| [Chroma](#chroma) | :material-check-circle:{.green} | :material-alert-circle:{.yellow} | |

__Supported__

: Indicates whether a given store is completely supported (:material-check-circle:{.green}) or has limited support (:material-alert-circle:{.yellow}).

__Collection Support__

: Indicates whether the store supports lists in metadata values or not. Stores which do not support it directly (:material-alert-circle:{.yellow}) can be used by applying the [MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer] document transformer to documents before writing, which spreads the items of the collection into multiple metadata keys.

__Combined Adjacent Query__

: Whether the store supports the more efficient "combined adjacent query" to retrieve nodes adjacent to multiple edges in a single query. Stores which don't use the combined query instead use a fallback implementation which issues a query for each edge. Stores that support the combined adjacent query perform much better, especially when retrieving large numbers of nodes and/or dealing with high connectivity.

!!! warning

Graph Retriever can be used with any of these supported Vector Stores. However, stores
that operate directly on normalized data and perform the combined adjacent query are
much more performant and better suited for production use. Stores like Chroma are best
employed for early experimentation, while it is generally recommended to use a store like DataStax AstraDB when scaling up.

## Supported Stores

### Astra

[DataStax AstraDB](https://www.datastax.com/products/datastax-astra) is
supported by the
[`AstraAdapter`][langchain_graph_retriever.adapters.astra.AstraAdapter]. The adapter
supports operating on metadata containing both primitive and list values. Additionally, it optimizes the request for nodes connected to multiple edges into a single query.

### OpenSearch

[OpenSearch](https://opensearch.org/) is supported by the [`OpenSearchAdapter`][langchain_graph_retriever.adapters.open_search.OpenSearchAdapter]. The adapter supports operating on metadata containing both primitive and list values. It does not combine the adjacent query.

### Apache Cassandra {: #cassandra}

[Apache Cassandra](https://cassandra.apache.org/) is supported by the [`CassandraAdapter`][langchain_graph_retriever.adapters.cassandra.CassandraAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.

### Chroma

[Chroma](https://www.trychroma.com/) is supported by the [`ChromaAdapter`][langchain_graph_retriever.adapters.chroma.ChromaAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.

## Implementation

The [Adapter][graph_retriever.adapters.Adapter] interface may be implemented directly. For LangChain [VectorStores][langchain_core.vectorstores.base.VectorStore], [LangchainAdapter][langchain_graph_retriever.adapters.langchain.LangchainAdapter] and [DenormalizedAdapter][langchain_graph_retriever.adapters.langchain.DenormalizedAdapter] provide much of the necessary functionality.
41 changes: 41 additions & 0 deletions docs/get-started/edges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Edges

Edges specify how content should be linked.
Often, content in existing vector stores has metadata based on structured information.
For example, a vector store containing articles may have information about the authors, keywords, and citations of those articles.
__Such content can be traversed along relationships already present in that metadata!__
See [Specifying Edges](#specifying-edges) for more on how edges are specified.

## Specifying Edges {: #specifying-edges}

```python title="Example content"
Content(
id="article1",
content="...",
metadata={
"keywords": ["GPT", "GenAI"],
"authors": ["Ben", "Eric"],
"primary_author": "Eric",
"cites": ["article2", "article3"],
}
)
```

1. `("keywords", "keywords")` connects to other articles about GPT and GenAI.
2. `("authors", "authors")` connects to other articles by any of the same authors.
3. `("authors", "primary_author")` connects to other articles whose primary author was Ben or Eric.
4. `("cites", Id())` connects to the articles cited (by ID).
5. `(Id(), "cites")` connects to articles which cite this one.
6. `("cites", "cites")` connects to other articles with citations in common.

## Edge Functions

While sometimes the information to traverse is missing and the vector store
needs to be re-populated, in other cases the information exist but not quite be
in a suitable format for traversal. For instance, the `"authors"` field may
contain a list of authors and their institution, making it impossible to link to
other articles by the same author when they were at a different institution.

In such cases, you can provide a custom
[`EdgeFunction`][graph_retriever.edges.EdgeFunction] to extract the edges for
traversal.
185 changes: 141 additions & 44 deletions docs/get-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,8 @@ We assume you already have a working `langchain` installation, including an LLM

In that case, you only need to install `langchain-graph-retriever`:

```{python}
#| eval: False
%pip install langchain langchain-graph-retriever
```bash
pip install langchain langchain-graph-retriever
```

## Preparing Data
Expand Down Expand Up @@ -51,59 +50,160 @@ For this guide, I have a JSON file with information about animals. Several examp
"habitat": "rainforest"
}
}
```

```{python}
import json
from langchain_core.documents import Document
animals = []
with open("../../data/animals.jsonl", "r") as file:
for line in file:
data = json.loads(line.strip())
animals.append(Document(
id=data["id"],
page_content=data["text"],
metadata=data["metadata"],
))
```python title="Fetching Animal Data"
from graph_rag_example_helpers.datasets.animals import fetch_documents
animals = fetch_documents()
```

## Populating the Vector Store

```{python}
from dotenv import load_dotenv
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

load_dotenv()
vector_store = AstraDBVectorStore.from_documents(
collection_name="animals",
documents=animals,
embedding=OpenAIEmbeddings(),
)
```
The following shows how to populate a variety of vector stores with the animal data.

=== "Astra"

```python
from dotenv import load_dotenv
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

load_dotenv()
vector_store = AstraDBVectorStore.from_documents(
collection_name="animals",
documents=animals,
embedding=OpenAIEmbeddings(),
)
```

=== "Apache Cassandra"

```python
from langchain_community.vectorstores.cassandra import Cassandra
from langchain_openai import OpenAIEmbeddings
from langchain_graph_retriever.transformers.metadata_denormalizer import (
MetadataDenormalizer,
)

metadata_denormalizer = MetadataDenormalizer() # (1)!
vector_store = Cassandra.from_documents(
documents=list(metadata_denormalizer.transform_documents(animals)),
embedding=OpenAIEmbeddings(),
table_name="animals",
)
```

1. Since Cassandra doesn't index items in lists for querying, it is necessary to
denormalize metadata containing list to be queried. By default, the
[MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
denormalizes all keys. It may be configured to only denormalize those
metadata keys used as edge targets.

=== "OpenSearch"

```python
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings

vector_store = OpenSearchVectorSearch.from_documents(
opensearch_url=OPEN_SEARCH_URL,
index_name="animals",
embedding_function=OpenAIEmbeddings(),
engine="faiss",
documents=animals,
)
```

=== "Chroma"

```python
from langchain_chroma.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_graph_retriever.transformers.metadata_denormalizer import (
MetadataDenormalizer,
)

metadata_denormalizer = MetadataDenormalizer() # (1)!
vector_store = Chroma.from_documents(
documents=list(metadata_denormalizer.transform_documents(animals)),
embedding=OpenAIEmbeddings(),
collection_name_name="animals",
)
```

1. Since Chroma doesn't index items in lists for querying, it is necessary to
denormalize metadata containing list to be queried. By default, the
[MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
denormalizes all keys. It may be configured to only denormalize those
metadata keys used as edge targets.

## Simple Traversal

For our first retrieval and graph traversal, we're going to start with a single animal best matching the query, and then traverse to other animals with the same `habitat` and/or `origin`.

```{python}
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever
=== "Astra"

simple = GraphRetriever(
# Adapt AstraDBVectorStore for use with Graph Retrievers.
store = vector_store,
# Define the relationships to navigate:
edges = [("habitat", "habitat"), ("origin", "origin")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```
```python
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

simple = GraphRetriever(
store = vector_store,
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```

=== "Apache Cassandra"

```python
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever
from langchain_graph_retriever.adapters.cassandra import CassandraAdapter

simple = GraphRetriever(
store = store = CassandraAdapter(vector_store, metadata_denormalizer, {"keywords"}),,
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```

=== "OpenSearch"

```python
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

simple = GraphRetriever(
store = vector_store,
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```


=== "Chroma"

```python
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever
from langchain_graph_retriever.adapters.chroma import ChromaAdapter

simple = GraphRetriever(
store = ChromaAdapter(vector_store, metadata_denormalizer, {"keywords"}),
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```

!!! note "Denormalization"

The above code is exactly the same for all stores, however adapters for denormalized stores (Chroma and Apache Cassandra) require configuration to specify which metadata fields need to be rewritten when issuing queries.

The above creates a graph traversing retriever that starts with the nearest animal (`start_k=1`), retrieves 10 documents (`k=10`) and limits the search to documents that are at most 2 steps away from the first animal (`depth=2`).

The edges define how metadata values can be used for traversal. In this case, every animal is connected to other animals with the same habitat and/or same origin.

```{python}
```python
simple_results = simple.invoke("what mammals could be found near a capybara")

for doc in simple_results:
Expand All @@ -115,10 +215,7 @@ for doc in simple_results:
`langchain-graph-retrievers` includes code for converting the document graph into a `networkx` graph, for rendering and other analysis.
See @fig-document-graph

```{python}
#| code-fold: True
#| label: fig-document-graph
#| fig-cap: "Graph of retrieved documents"
```python title="Graph retrieved documents"
import networkx as nx
import matplotlib.pyplot as plt
from langchain_graph_retriever.document_graph import create_graph
Expand Down
29 changes: 29 additions & 0 deletions docs/get-started/strategies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Strategies

Strategies determine which nodes are selected during [traversal](./traversal.md).

All strategies allow you to control how many nodes are retrieved (`k`) as well
as how many nodes are found during the initial vector search (`start_k`) and
each step of the traversal (`adjacent_k`) as well as bounding the nodes
retrieved based on depth (`max_depth`).

## Eager

The [`Eager`][graph_retriever.strategies.Eager] strategy selects all of the discovered nodes at each step of the traversal.

It doesn't support configuration beyond the standard options.

## MMR

The [`MMR`][graph_retriever.strategies.Mmr] strategy selects nodes with the
highest maximum marginal relevance score at each iteration.

It can be configured with a `lambda_mult` which controls the trade-off between relevance and diversity.

## Scored

The [`Scored`][graph_retriever.strategies.Scored] strategy applies a user-defined function to each node to assign a score, and selects a number of nodes with the highest scores.

## User-Defined Strategies

You can also implement your own [`Strategy`][graph_retriever.strategies.Strategy]. This allows you to control how discovered nodes are tracked and selected for traversal.
12 changes: 12 additions & 0 deletions docs/get-started/traversal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Traversal

At a high level, traversal performs the following steps:

1. Retrieve `start_k` most similar to the `query` using vector search.
2. Find the nodes reachable from the `initial_root_ids`.
3. Discover the `start_k` nodes and the neighbors of the initial roots as "depth 0" candidates.
4. Ask the strategy which nodes to visit next.
5. If no more nodes to visit, exit and return the selected nodes.
6. Record those nodes as selected and retrieve the top `adjacent_k` nodes reachable from them.
7. Discover the newly reachable nodes (updating depths as needed).
8. Goto 4.
Loading