This repository has been archived by the owner on Feb 18, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 35
p2p/discover: improved node revalidation #687
Open
sonhv0212
wants to merge
4
commits into
axieinfinity:master
Choose a base branch
from
sonhv0212:discovery-new-reval
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
b31535a
to
856723c
Compare
We should copy description from the above commit to this PR. This PR will revert the effect of this commit: f0935e6 ("p2p/discover: apply enr filtering in discovery phase, fix minor logic in doRevalidate"). So maybe @huyngopt1994 wants to take a look. |
@minh-bq Yes, @sonhv0212 need to rework this PR again for adapting with the implementing enr filtering . |
@sonhv0212 Coul u rework this PR which including some commits from Private repo in here ? |
856723c
to
831f53a
Compare
Node discovery periodically revalidates the nodes in its table by sending PING, checking if they are still alive. I recently noticed some issues with the implementation of this process, which can cause strange results such as nodes dropping unexpectedly, certain nodes not getting revalidated often enough, and bad results being returned to incoming FINDNODE queries. In this change, the revalidation process is improved with the following logic: - We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'. - The process chooses random nodes from each list on a randomized interval, the interval being faster for the 'fast' list, and performs revalidation for the chosen node. - Whenever a node is newly inserted into the table, it goes into the 'fast' list. Once validation passes, it transfers to the 'slow' list. If a request fails, or the node changes endpoint, it transfers back into 'fast'. - livenessChecks is incremented by one for successful checks. Unlike the old implementation, we will not drop the node on the first failing check. We instead quickly decay the livenessChecks give it another chance. - Order of nodes in bucket doesn't matter anymore. I am also adding a debug API endpoint to dump the node table content. Co-Authored-By: Martin HS <martin@swende.se>
In #29572, I assumed the revalidation list that the node is contained in could only ever be changed by the outcome of a revalidation request. But turns out that's not true: if the node gets removed due to FINDNODE failure, it will also be removed from the list it is in. This causes a crash. The invariant is: while node is in table, it is always in exactly one of the two lists. So it seems best to store a pointer to the current list within the node itself.
It seems the semantic differences between addFoundNode and addInboundNode were lost in (and are unsure if is available) whereas addInboundNode is for adding nodes that have contacted the local node and we can verify they are active. handleAddNode seems to be the consolidation of those two methods, yet it bumps the node in the bucket (updating it's IP addr) even if the node was not an inbound. This PR fixes this. It wasn't originally caught in tests like TestTable_addSeenNode because the manipulation of the node object actually modified the node value used by the test. New logic is added to reject non-inbound updates unless the sequence number of the (signed) ENR increases. Inbound updates, which are published by the updated node itself, are always accepted. If an inbound update changes the endpoint, the node will be revalidated on an expedited schedule. Co-Authored-By: Felix Lange <fjl@twurst.com>
…(#30239) If `nextTime` has passed, but all nodes are excluded, `get` would return `nil` and `run` would therefore not invoke `schedule`. Then, we schedule a timer for the past, as neither `nextTime` value has been updated. This creates a busy loop, as the timer immediately returns. With this PR, revalidation will be also rescheduled when all nodes are excluded. --------- Co-Authored-By: lightclient <lightclient@protonmail.com>
831f53a
to
c0ed11c
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Cherry-picked following commits from go-ethereum to improve table revalidation and fix some follow-after issues.
ethereum/go-ethereum#29572
ethereum/go-ethereum#29864
ethereum/go-ethereum#29836
ethereum/go-ethereum#30239
Problem
The current node discovery revalidation process in the distributed network has issues leading to inconsistent results, such as unexpected node drops (drop at the first time that revalidate failed) and uneven revalidation frequencies. The old approach involved selecting a random bucket every ~10s, validating the last node via PING, and replacing it if unresponsive. However, it was inefficient, revalidating each node every ~13.3 minutes, with uneven distribution favoring less-populated buckets.
Benchmark result
Strategy: Use p2psim to simulate a network of 200 nodes which includes 50% unhealthy nodes.
Conclustion: