-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(l1): full sync #2061
base: main
Are you sure you want to change the base?
fix(l1): full sync #2061
Conversation
|
@@ -29,12 +27,21 @@ use crate::{ | |||
snap::encodable_to_proof, | |||
}; | |||
use tracing::info; | |||
pub const PEER_REPLY_TIMOUT: Duration = Duration::from_secs(45); | |||
pub const PEER_REPLY_TIMEOUT: Duration = Duration::from_secs(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if instead of having a constant for all requests, it should be more fine grained. I guess there are going to be some snap requests that are larger and need a larger timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In full syncing, requests are typically fulfilled very quickly (well under 5 seconds). If a peer takes that long to respond, it's likely unstable, and we should consider removing it. For comparison, geth applies the same timeout to all requests (see here), but their limit is much higher: 30
seconds.
The problem is that we don’t have a great way to filter "well-behaved" peers, we just select them randomly. When we have only a small number of peers and allow timeouts of 30 or 45 seconds (as before), it significantly slows down syncing if one of them is taking too much time to respond. The real fix for this should come in #2073.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.
crates/networking/p2p/sync.rs
Outdated
@@ -165,6 +168,8 @@ impl SyncManager { | |||
// This step is not parallelized | |||
let mut all_block_hashes = vec![]; | |||
// Check if we have some blocks downloaded from a previous sync attempt | |||
// We only check for snap syncing as full sync will start fetching headers starting from the canonical block | |||
// which is update every time a new block is added which is done as we fetch block headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is a bit weird, I don't really understand it. Just to clarify, canonical blocks are not updated when you add a block, but when you forkchoice update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, during full syncing, any newly added block is always set as the head of the chain. When we receive a fork choice update, we start from the canonical chain head, which, in full sync mode, corresponds to the most recently added block. So, for full syncing, we don't really need to have checkpoints as they are set when a new block is added.
Maybe we are re-requesting headers or block bodies, but IMO we should worry about it now.
I will update the comment tho, I aggre it is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comment in 017c7cc.
crates/networking/p2p/sync.rs
Outdated
) | ||
.await? | ||
} | ||
// Full sync stores and executes blocks as it ask for the headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Full sync stores and executes blocks as it ask for the headers | |
// Full sync stores and executes blocks as it asks for the headers |
Does Geth also start downloading bodies before it finishes downloading headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because by only downloading block headers and verifying them (that the chain is correct) you can have an "optimistic head". This is probably out of scope of this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geth runs multiple fetchers in parallel, they have a processes that fetches headers and another that validates it and pushes it into a queue so that a third process fetches the body and executes it to insert it into a sidechain, which at some point is set as the head. See here.
I agree that requesting all headers and validating them could be better. Currently, we have no way to retrieve block headers by a particular order (by block number for example). Since our table maps BlockHash => BlockHeader, fetching them in order isn't so trivial. I guess one solution might be to redesign the table so the key is a tuple of (BlockHash, BlockNumber).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved typo in 017c7cc.
return Err(error.into()); | ||
/// Requests block bodies from peers via p2p, executes and stores them | ||
/// Returns an error if there was a problem while executing or validating the blocks | ||
async fn download_and_run_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, what do you think of moving full sync specific logic to it's own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be a great refactor, right now we have a bunch of ifs to check the sync mode that makes the code rather uncomfortable to read IMO.
let block = Block::new(header, body); | ||
if let Err(error) = self.blockchain.add_block(&block) { | ||
warn!("Failed to add block during FullSync: {error}"); | ||
self.invalid_ancestors.insert(hash, last_valid_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially you are setting H256::default()
here, does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as defined in invalid ancestors, when the last valid hash is not known, it is defined in the engine API to return 0x0
or null
as the latest_valid_hash. See 3. here.
return Err(error.into()); | ||
} | ||
store.set_canonical_block(number, hash)?; | ||
store.update_latest_block_number(number)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not saying it should be done on this PR, but ideally, this would call forkchoice_updated function (or a set_head
) so that we don't repeat this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, this should be refactored.
great work 🚀 🚀 🚀 . Just a few minor comments. |
Motivation
Holesky syncing.
Description
The following changes were applied to fix full syncing:
Finally, the peer timeout has been reduced to 5s (from 45s). The previous timeout was too high, causing nodes to get stuck when dealing with unstable peers. We should consider implementing peer penalization and scoring mechanisms instead of randomly selecting peers.
Considerations
The current full syncing process is very sequential:
Moving forward, we could consider parallelizing the fetching process. However, it is not clear whether this would provide significant benefits, as requests are already fulfilled quickly. The real bottleneck lies in block execution and validation. Given this, I think that the potential gain in performance isn't worth the complexity of parallelizing fetching (at least for full syncing).
An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.
Closes None