Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated the strategy design to separate traversal from node selection #152

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

epinzur
Copy link
Collaborator

@epinzur epinzur commented Feb 26, 2025

No description provided.

max_depth :
Maximum traversal depth. If `None`, there is no limit.
"""

k: int = 5
start_k: int = 4
adjacent_k: int = 10
traverse_k: int = 4 # max_traverse?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't currently used anywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a limit on the total number of nodes traversed

@coveralls
Copy link

coveralls commented Feb 26, 2025

Pull Request Test Coverage Report for Build 13553888756

Details

  • 79 of 79 (100.0%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage remained the same at 100.0%

Files with Coverage Reduction New Missed Lines %
packages/graph-retriever/src/graph_retriever/traversal.py 1 95.97%
Totals Coverage Status
Change from base Build 13526256011: 0.0%
Covered Lines: 0
Relevant Lines: 0

💛 - Coveralls

Copy link

github-actions bot commented Feb 26, 2025

Test Results

  5 files   -     3    5 suites   - 3   13s ⏱️ - 2m 36s
519 tests ±    0  170 ✅  - 325  320 💤 +296   29 ❌ + 29 
837 runs   - 1 239  410 ✅  - 678  320 💤  - 668  107 ❌ +107 

For more details on these failures, see this check.

Results for commit 2b72460. ± Comparison against base commit 4981524.

This pull request skips 296 tests.
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[absent_dict]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[dict_in_list]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[dict_in_list_multiple]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[filtered_ids]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[ids_limit_k]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[many_ids]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[metadata_and_id]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested_diff_field]
packages.langchain-graph-retriever.tests.adapters.test_astra.TestAstraAdapter ‑ test_aadjacent[nested_same_field]
…

♻️ This comment has been updated with latest results.


def select_and_traverse(self, nodes: dict[str, Node]) -> None:
"""Select nodes to be included in the result set and the next traversal."""
self.select(nodes=nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doesn't need to be named -- just self.select(nodes).

max_depth :
Maximum traversal depth. If `None`, there is no limit.
"""

k: int = 5
start_k: int = 4
adjacent_k: int = 10
traverse_k: int = 4 # max_traverse?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k or something.

"""Helper class for tracking traversal progress."""

def __init__(self) -> None:
self.visited_ids: set[str] = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should visited_ids be removed for now? Or should more tracking logic move into here.


def traverse(self, nodes: dict[str, Node]) -> None:
"""Select nodes to be included in the next traversal."""
self.to_traverse.update(nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if this doesn't track visited, then we potentially have cases where already visited nodes are added (increasing the traverse set) but not actually be traversed again. It may make the behavior of things easier to reason about if visited nodes are tracked here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have these return the number of nodes that will actually be traversed or selected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty the to_traverse queue at the end of an interation

"""
Add discovered nodes to the strategy.
Process the newly discovered nodes on each iteration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth elaborating on this -- the newly discovered seems important. For instance, this means that if a strategy wants to potentially visit them in a future iteration (such as MMR) it needs to remember these nodes and select/traverse them later.

next_outgoing_edges = self.select_next_edges()
if next_outgoing_edges is None:
break
elif next_outgoing_edges:
# Find the (new) document with incoming edges from those edges.
adjacent_docs = self._fetch_adjacent(next_outgoing_edges)
self.add_docs(adjacent_docs)
adjacent_content = self._fetch_adjacent(next_outgoing_edges)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something along new_nodes or discovered_nodes rather than adjacent_content to emphasize the newness?

@@ -442,8 +392,7 @@ def add_docs(
self.strategy.max_depth is None or node.depth <= self.strategy.max_depth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird -- I wonder if max_depth should be a parameter of the traverse methods. It feels weird that the traversal inspects the max_depth parameter and implements it, rather than having it happen within the strategy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially move to node_tracker... this is another reason why a node might not be traversed. so it affects the strategy's decisions about traversal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can still move the parameter to traversal

@@ -485,13 +434,14 @@ def visit_nodes(self, nodes: Iterable[Node]) -> set[Edge]:

new_outgoing_edge_set = set(new_outgoing_edges.keys())
self._visited_edges.update(new_outgoing_edge_set)
self._node_tracker.visited_ids.update([n.id for n in nodes])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be cleaner to have a self._node_tracker.visited(nodes) or some such. The rationale being that then it is clear something is modifying the visited ids -- right now it looked pseudo unused.

if remaining <= 0:
return None

next_nodes = self.strategy.select_nodes(limit=remaining)
next_nodes = self._node_tracker.to_traverse.values()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases (scored) it shouldn't select all of the to_traverse. I can see a few ways of thinking about it:

  1. The strategy should "buffer" nodes, and only add the nodes to to_traverse that it actually wishes to traverse (somewhat similar to MMR).
  2. The strategy should add all nodes it has decided to traverse, and then it should be given a chance to remove the nodes it actually wants. This would allow the to_traverse to be used as the queue, and require a method on the strategy for extracting the nodes to traverse from that (default implementation would return all).
  3. The traversal could implement this logic of choosing some number of nodes from the node trakcer.

Of these, it feels like we're currently doing (1). I wonder if (2) would be reasonable. It would allow for things like scored to just use the existing to_traverse and implement the limit on how many to traverse at each iteration.

I believe it would also allow implement the max_traverse since it could observe how many nodes are selected for traversal there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going with 1.

if not next_nodes:
return None

next_nodes = [n for n in next_nodes if n.id not in self._selected_nodes]
if len(next_nodes) == 0:
filtered_nodes = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tracker should possibly have some more logic around visited IDs, especially for implementing things like max_traverse / traverse_k and the "how many items to select at each iteration". Without knowing whether a given ID is a noop (already visited) or new, it becomes really difficult to implement that. For instance, if it has 100 items queued, and picks 10 of them, but those are already visited, we'll filter them all out and stop iteration, even if the other 90 haven't been visited.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this. It seems like managing the already visited nodes isn't something we want the strategies to bother with.

If NodeTracker is the primary interface that a strategy can use to traverse or select a node, it seems like the interface should be limited to that.

For now I'm going to move the visited_ids back to traversal.py and remove it from the tracker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we need the strategy/tracker to be aware of visited ids in order to get the correct behavior, for the reason described above.

For example, the scored strategy let's you say "pick the top N nodes each iteration". If that picks N visited nodes that get filtered out in traversal, then the traversal will interpret that as no new nodes to traverse, and exit early. We could distinguish between None (terminate) and [] (everything was filtered out), but that still means that a strategy that is trying to "traverse N nodes" will actually "traverse up to N nodes" based on which ones are actually new. Given that traversing only affects new nodes, it seems important for the strategy to actually know which nodes are new, and thus which will be traversed.

start_k :
Number of documents to fetch via similarity for starting the traversal.
Added to any initial roots provided to the traversal.
adjacent_k :
Number of documents to fetch for each outgoing edge.
traverse_k :
Maximum number of nodes to traverse outgoing edges from before returning.
max_depth :
Maximum traversal depth. If `None`, there is no limit.
"""

k: int = 5
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially change to select_k

@@ -10,6 +10,28 @@
from graph_retriever.types import Node


class NodeTracker:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a property to compute the remaining unselected.

def remaining(self):
"""The remaining number of nodes to be selected"""
return max(self._select_k - len(self.selected), 0)

def select(self, nodes: dict[str, Node]) -> None:
"""Select nodes to be included in the result set."""
self.selected.update(nodes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should select return the number of new nodes (or the number of total nodes)?

I guess most use cases would be better off with len(tracker.selected). Although, we could have a @property def num_selected(self) -> int?

self.to_traverse: dict[str, Node] = {}
self.selected: dict[str, Node] = {}

@property
def remaining(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe num_remaining or something like that to emphasize it's a number? And maybe mentioning selected (num_unselected?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants