updated the strategy design to separate traversal from node selection #152

epinzur · 2025-02-26T21:38:25Z

No description provided.

epinzur · 2025-02-26T21:39:14Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

    max_depth :
        Maximum traversal depth. If `None`, there is no limit.
    """

    k: int = 5
    start_k: int = 4
    adjacent_k: int = 10
+    traverse_k: int = 4  # max_traverse?


this isn't currently used anywhere.

Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k or something.

this is a limit on the total number of nodes traversed

coveralls · 2025-02-26T21:42:44Z

Pull Request Test Coverage Report for Build 13553888756

Details

79 of 79 (100.0%) changed or added relevant lines in 5 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage remained the same at 100.0%

Files with Coverage Reduction	New Missed Lines	%
packages/graph-retriever/src/graph_retriever/traversal.py	1	95.97%

Totals
Change from base Build 13526256011:	0.0%
Covered Lines:	0
Relevant Lines:	0

💛 - Coveralls

github-actions · 2025-02-26T21:43:45Z

Test Results

5 files - 3 5 suites - 3 13s ⏱️ - 2m 36s
519 tests ± 0 170 ✅ - 325 320 💤 +296 29 ❌ + 29
837 runs - 1 239 410 ✅ - 678 320 💤 - 668 107 ❌ +107

For more details on these failures, see this check.

Results for commit 2b72460. ± Comparison against base commit 4981524.

This pull request skips 296 tests.

packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[absent_dict]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[dict_in_list]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[dict_in_list_multiple]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[filtered_ids]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[ids_limit_k]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[many_ids]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[metadata_and_id]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested_diff_field]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested_same_field]
…

♻️ This comment has been updated with latest results.

bjchambers · 2025-02-27T05:35:04Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

+
+    def select_and_traverse(self, nodes: dict[str, Node]) -> None:
+        """Select nodes to be included in the result set and the next traversal."""
+        self.select(nodes=nodes)


nit: doesn't need to be named -- just self.select(nodes).

bjchambers · 2025-02-27T05:36:02Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

    max_depth :
        Maximum traversal depth. If `None`, there is no limit.
    """

    k: int = 5
    start_k: int = 4
    adjacent_k: int = 10
+    traverse_k: int = 4  # max_traverse?


Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k or something.

bjchambers · 2025-02-27T05:36:19Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

+    """Helper class for tracking traversal progress."""
+
+    def __init__(self) -> None:
+        self.visited_ids: set[str] = set()


Should visited_ids be removed for now? Or should more tracking logic move into here.

bjchambers · 2025-02-27T05:37:16Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

+
+    def traverse(self, nodes: dict[str, Node]) -> None:
+        """Select nodes to be included in the next traversal."""
+        self.to_traverse.update(nodes)


Note: if this doesn't track visited, then we potentially have cases where already visited nodes are added (increasing the traverse set) but not actually be traversed again. It may make the behavior of things easier to reason about if visited nodes are tracked here.

have these return the number of nodes that will actually be traversed or selected.

empty the to_traverse queue at the end of an interation

bjchambers · 2025-02-27T05:38:28Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

        """
-        Add discovered nodes to the strategy.
+        Process the newly discovered nodes on each iteration.


May be worth elaborating on this -- the newly discovered seems important. For instance, this means that if a strategy wants to potentially visit them in a future iteration (such as MMR) it needs to remember these nodes and select/traverse them later.

bjchambers · 2025-02-27T05:42:28Z

packages/graph-retriever/src/graph_retriever/traversal.py

            next_outgoing_edges = self.select_next_edges()
            if next_outgoing_edges is None:
                break
            elif next_outgoing_edges:
                # Find the (new) document with incoming edges from those edges.
-                adjacent_docs = self._fetch_adjacent(next_outgoing_edges)
-                self.add_docs(adjacent_docs)
+                adjacent_content = self._fetch_adjacent(next_outgoing_edges)


Perhaps something along new_nodes or discovered_nodes rather than adjacent_content to emphasize the newness?

bjchambers · 2025-02-27T05:43:59Z

packages/graph-retriever/src/graph_retriever/traversal.py

@@ -442,8 +392,7 @@ def add_docs(
                self.strategy.max_depth is None or node.depth <= self.strategy.max_depth


This feels weird -- I wonder if max_depth should be a parameter of the traverse methods. It feels weird that the traversal inspects the max_depth parameter and implements it, rather than having it happen within the strategy.

potentially move to node_tracker... this is another reason why a node might not be traversed. so it affects the strategy's decisions about traversal.

can still move the parameter to traversal

bjchambers · 2025-02-27T05:44:48Z

packages/graph-retriever/src/graph_retriever/traversal.py

@@ -485,13 +434,14 @@ def visit_nodes(self, nodes: Iterable[Node]) -> set[Edge]:

        new_outgoing_edge_set = set(new_outgoing_edges.keys())
        self._visited_edges.update(new_outgoing_edge_set)
+        self._node_tracker.visited_ids.update([n.id for n in nodes])


It might be cleaner to have a self._node_tracker.visited(nodes) or some such. The rationale being that then it is clear something is modifying the visited ids -- right now it looked pseudo unused.

bjchambers · 2025-02-27T05:47:41Z

packages/graph-retriever/src/graph_retriever/traversal.py

        if remaining <= 0:
            return None

-        next_nodes = self.strategy.select_nodes(limit=remaining)
+        next_nodes = self._node_tracker.to_traverse.values()


In some cases (scored) it shouldn't select all of the to_traverse. I can see a few ways of thinking about it:

The strategy should "buffer" nodes, and only add the nodes to to_traverse that it actually wishes to traverse (somewhat similar to MMR).

The strategy should add all nodes it has decided to traverse, and then it should be given a chance to remove the nodes it actually wants. This would allow the to_traverse to be used as the queue, and require a method on the strategy for extracting the nodes to traverse from that (default implementation would return all).

The traversal could implement this logic of choosing some number of nodes from the node trakcer.

Of these, it feels like we're currently doing (1). I wonder if (2) would be reasonable. It would allow for things like scored to just use the existing to_traverse and implement the limit on how many to traverse at each iteration.

I believe it would also allow implement the max_traverse since it could observe how many nodes are selected for traversal there.

going with 1.

bjchambers · 2025-02-27T05:49:12Z

packages/graph-retriever/src/graph_retriever/traversal.py

        if not next_nodes:
            return None

-        next_nodes = [n for n in next_nodes if n.id not in self._selected_nodes]
-        if len(next_nodes) == 0:
+        filtered_nodes = [


I think the tracker should possibly have some more logic around visited IDs, especially for implementing things like max_traverse / traverse_k and the "how many items to select at each iteration". Without knowing whether a given ID is a noop (already visited) or new, it becomes really difficult to implement that. For instance, if it has 100 items queued, and picks 10 of them, but those are already visited, we'll filter them all out and stop iteration, even if the other 90 haven't been visited.

I'm torn on this. It seems like managing the already visited nodes isn't something we want the strategies to bother with.

If NodeTracker is the primary interface that a strategy can use to traverse or select a node, it seems like the interface should be limited to that.

For now I'm going to move the visited_ids back to traversal.py and remove it from the tracker.

I actually think we need the strategy/tracker to be aware of visited ids in order to get the correct behavior, for the reason described above.

For example, the scored strategy let's you say "pick the top N nodes each iteration". If that picks N visited nodes that get filtered out in traversal, then the traversal will interpret that as no new nodes to traverse, and exit early. We could distinguish between None (terminate) and [] (everything was filtered out), but that still means that a strategy that is trying to "traverse N nodes" will actually "traverse up to N nodes" based on which ones are actually new. Given that traversing only affects new nodes, it seems important for the strategy to actually know which nodes are new, and thus which will be traversed.

epinzur · 2025-02-27T22:18:02Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

    start_k :
        Number of documents to fetch via similarity for starting the traversal.
        Added to any initial roots provided to the traversal.
    adjacent_k :
        Number of documents to fetch for each outgoing edge.
+    traverse_k :
+        Maximum number of nodes to traverse outgoing edges from before returning.
    max_depth :
        Maximum traversal depth. If `None`, there is no limit.
    """

    k: int = 5


potentially change to select_k

epinzur · 2025-02-27T22:33:37Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

@@ -10,6 +10,28 @@
 from graph_retriever.types import Node


+class NodeTracker:


maybe add a property to compute the remaining unselected.

bjchambers · 2025-02-28T00:25:07Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

+    def remaining(self):
+        """The remaining number of nodes to be selected"""
+        return max(self._select_k - len(self.selected), 0)
+
    def select(self, nodes: dict[str, Node]) -> None:
        """Select nodes to be included in the result set."""
        self.selected.update(nodes)


Should select return the number of new nodes (or the number of total nodes)?

I guess most use cases would be better off with len(tracker.selected). Although, we could have a @property def num_selected(self) -> int?

bjchambers · 2025-02-28T00:25:37Z

packages/graph-retriever/src/graph_retriever/strategies/base.py

        self.to_traverse: dict[str, Node] = {}
        self.selected: dict[str, Node] = {}

+    @property
+    def remaining(self):


Maybe num_remaining or something like that to emphasize it's a number? And maybe mentioning selected (num_unselected?)

bjchambers and others added 3 commits February 25, 2025 10:05

wip play around with some strategy ideas

db1f3ae

stub

66b852b

implemented new strategy design

4d7da52

epinzur commented Feb 26, 2025

View reviewed changes

bjchambers reviewed Feb 27, 2025

View reviewed changes

epinzur commented Feb 27, 2025

View reviewed changes

progress

2b72460

bjchambers reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated the strategy design to separate traversal from node selection #152

updated the strategy design to separate traversal from node selection #152

epinzur commented Feb 26, 2025

epinzur Feb 26, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

coveralls commented Feb 26, 2025 •

edited

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

epinzur Feb 27, 2025

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

epinzur Feb 27, 2025

bjchambers Feb 27, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

bjchambers Feb 27, 2025

epinzur Feb 27, 2025

epinzur Feb 27, 2025

bjchambers Feb 28, 2025

bjchambers Feb 28, 2025

		@@ -442,8 +392,7 @@ def add_docs(
		self.strategy.max_depth is None or node.depth <= self.strategy.max_depth

		@@ -10,6 +10,28 @@
		from graph_retriever.types import Node


		class NodeTracker:

updated the strategy design to separate traversal from node selection #152

Are you sure you want to change the base?

updated the strategy design to separate traversal from node selection #152

Conversation

epinzur commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Feb 26, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13553888756

Details

💛 - Coveralls

github-actions bot commented Feb 26, 2025 • edited Loading

Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Feb 26, 2025 •

edited

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading