-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updated the strategy design to separate traversal from node selection #152
base: main
Are you sure you want to change the base?
Conversation
max_depth : | ||
Maximum traversal depth. If `None`, there is no limit. | ||
""" | ||
|
||
k: int = 5 | ||
start_k: int = 4 | ||
adjacent_k: int = 10 | ||
traverse_k: int = 4 # max_traverse? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't currently used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a limit on the total number of nodes traversed
Pull Request Test Coverage Report for Build 13553888756Details
💛 - Coveralls |
Test Results 5 files - 3 5 suites - 3 13s ⏱️ - 2m 36s For more details on these failures, see this check. Results for commit 2b72460. ± Comparison against base commit 4981524. This pull request skips 296 tests.
♻️ This comment has been updated with latest results. |
|
||
def select_and_traverse(self, nodes: dict[str, Node]) -> None: | ||
"""Select nodes to be included in the result set and the next traversal.""" | ||
self.select(nodes=nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doesn't need to be named -- just self.select(nodes)
.
max_depth : | ||
Maximum traversal depth. If `None`, there is no limit. | ||
""" | ||
|
||
k: int = 5 | ||
start_k: int = 4 | ||
adjacent_k: int = 10 | ||
traverse_k: int = 4 # max_traverse? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I believe this is a new bound which the traversal and/or strategy would need to respect and terminate traversal once the len(tracker.visited_ids) > traverse_k
or something.
"""Helper class for tracking traversal progress.""" | ||
|
||
def __init__(self) -> None: | ||
self.visited_ids: set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should visited_ids
be removed for now? Or should more tracking logic move into here.
|
||
def traverse(self, nodes: dict[str, Node]) -> None: | ||
"""Select nodes to be included in the next traversal.""" | ||
self.to_traverse.update(nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if this doesn't track visited, then we potentially have cases where already visited nodes are added (increasing the traverse set) but not actually be traversed again. It may make the behavior of things easier to reason about if visited nodes are tracked here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have these return the number of nodes that will actually be traversed or selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty the to_traverse
queue at the end of an interation
""" | ||
Add discovered nodes to the strategy. | ||
Process the newly discovered nodes on each iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth elaborating on this -- the newly discovered seems important. For instance, this means that if a strategy wants to potentially visit them in a future iteration (such as MMR) it needs to remember these nodes and select/traverse them later.
next_outgoing_edges = self.select_next_edges() | ||
if next_outgoing_edges is None: | ||
break | ||
elif next_outgoing_edges: | ||
# Find the (new) document with incoming edges from those edges. | ||
adjacent_docs = self._fetch_adjacent(next_outgoing_edges) | ||
self.add_docs(adjacent_docs) | ||
adjacent_content = self._fetch_adjacent(next_outgoing_edges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps something along new_nodes
or discovered_nodes
rather than adjacent_content
to emphasize the newness?
@@ -442,8 +392,7 @@ def add_docs( | |||
self.strategy.max_depth is None or node.depth <= self.strategy.max_depth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird -- I wonder if max_depth
should be a parameter of the traverse methods. It feels weird that the traversal inspects the max_depth
parameter and implements it, rather than having it happen within the strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially move to node_tracker... this is another reason why a node might not be traversed. so it affects the strategy's decisions about traversal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can still move the parameter to traversal
@@ -485,13 +434,14 @@ def visit_nodes(self, nodes: Iterable[Node]) -> set[Edge]: | |||
|
|||
new_outgoing_edge_set = set(new_outgoing_edges.keys()) | |||
self._visited_edges.update(new_outgoing_edge_set) | |||
self._node_tracker.visited_ids.update([n.id for n in nodes]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cleaner to have a self._node_tracker.visited(nodes)
or some such. The rationale being that then it is clear something is modifying the visited ids -- right now it looked pseudo unused.
if remaining <= 0: | ||
return None | ||
|
||
next_nodes = self.strategy.select_nodes(limit=remaining) | ||
next_nodes = self._node_tracker.to_traverse.values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases (scored) it shouldn't select all of the to_traverse. I can see a few ways of thinking about it:
- The strategy should "buffer" nodes, and only add the nodes to
to_traverse
that it actually wishes to traverse (somewhat similar to MMR). - The strategy should add all nodes it has decided to traverse, and then it should be given a chance to remove the nodes it actually wants. This would allow the
to_traverse
to be used as the queue, and require a method on the strategy for extracting the nodes to traverse from that (default implementation would return all). - The traversal could implement this logic of choosing some number of nodes from the node trakcer.
Of these, it feels like we're currently doing (1). I wonder if (2) would be reasonable. It would allow for things like scored to just use the existing to_traverse
and implement the limit on how many to traverse at each iteration.
I believe it would also allow implement the max_traverse
since it could observe how many nodes are selected for traversal there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going with 1.
if not next_nodes: | ||
return None | ||
|
||
next_nodes = [n for n in next_nodes if n.id not in self._selected_nodes] | ||
if len(next_nodes) == 0: | ||
filtered_nodes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tracker should possibly have some more logic around visited IDs, especially for implementing things like max_traverse
/ traverse_k
and the "how many items to select at each iteration". Without knowing whether a given ID is a noop (already visited) or new, it becomes really difficult to implement that. For instance, if it has 100 items queued, and picks 10 of them, but those are already visited, we'll filter them all out and stop iteration, even if the other 90 haven't been visited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn on this. It seems like managing the already visited nodes isn't something we want the strategies to bother with.
If NodeTracker is the primary interface that a strategy can use to traverse or select a node, it seems like the interface should be limited to that.
For now I'm going to move the visited_ids
back to traversal.py
and remove it from the tracker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we need the strategy/tracker to be aware of visited ids in order to get the correct behavior, for the reason described above.
For example, the scored strategy let's you say "pick the top N nodes each iteration". If that picks N visited nodes that get filtered out in traversal, then the traversal will interpret that as no new nodes to traverse, and exit early. We could distinguish between None
(terminate) and []
(everything was filtered out), but that still means that a strategy that is trying to "traverse N nodes" will actually "traverse up to N nodes" based on which ones are actually new. Given that traversing only affects new nodes, it seems important for the strategy to actually know which nodes are new, and thus which will be traversed.
start_k : | ||
Number of documents to fetch via similarity for starting the traversal. | ||
Added to any initial roots provided to the traversal. | ||
adjacent_k : | ||
Number of documents to fetch for each outgoing edge. | ||
traverse_k : | ||
Maximum number of nodes to traverse outgoing edges from before returning. | ||
max_depth : | ||
Maximum traversal depth. If `None`, there is no limit. | ||
""" | ||
|
||
k: int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially change to select_k
@@ -10,6 +10,28 @@ | |||
from graph_retriever.types import Node | |||
|
|||
|
|||
class NodeTracker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a property to compute the remaining unselected.
def remaining(self): | ||
"""The remaining number of nodes to be selected""" | ||
return max(self._select_k - len(self.selected), 0) | ||
|
||
def select(self, nodes: dict[str, Node]) -> None: | ||
"""Select nodes to be included in the result set.""" | ||
self.selected.update(nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should select
return the number of new nodes (or the number of total nodes)?
I guess most use cases would be better off with len(tracker.selected)
. Although, we could have a @property def num_selected(self) -> int
?
self.to_traverse: dict[str, Node] = {} | ||
self.selected: dict[str, Node] = {} | ||
|
||
@property | ||
def remaining(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe num_remaining
or something like that to emphasize it's a number? And maybe mentioning selected (num_unselected
?)
No description provided.