Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(l1): full sync #2061

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

fix(l1): full sync #2061

wants to merge 9 commits into from

Conversation

MarcosNicolau
Copy link
Contributor

@MarcosNicolau MarcosNicolau commented Feb 24, 2025

Motivation
Holesky syncing.

Description
The following changes were applied to fix full syncing:

  1. Incremental Block Execution & Storage
    • Block bodies are now executed and stored as we fetch block headers.
    • Previously, we fetched all block headers first and then processed block bodies. This approach had a major flaw: if block processing failed at any point, we had to re-fetch all block headers from scratch.
  2. Split block body requests in chunks
    • Block body requests are now split into chunks of 128.
    • Previously, all block bodies were requested in a single batch, which caused the stream to crash by exceeding the maximum allowed data size.
  3. Reduced req size limit in block headers.
    • We observed that while larger request values were supported, increasing them would lead to peer disconnections.

Finally, the peer timeout has been reduced to 5s (from 45s). The previous timeout was too high, causing nodes to get stuck when dealing with unstable peers. We should consider implementing peer penalization and scoring mechanisms instead of randomly selecting peers.

Considerations

The current full syncing process is very sequential:

fetches block headers -> fetches block bodies -> for each block -> validate block header + body -> store block -> set it as the head of the chain.

Moving forward, we could consider parallelizing the fetching process. However, it is not clear whether this would provide significant benefits, as requests are already fulfilled quickly. The real bottleneck lies in block execution and validation. Given this, I think that the potential gain in performance isn't worth the complexity of parallelizing fetching (at least for full syncing).

An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.

Closes None

@MarcosNicolau MarcosNicolau requested a review from a team as a code owner February 24, 2025 19:26
Copy link

github-actions bot commented Feb 24, 2025

| File                                                                  | Lines | Diff |
+-----------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/peer_handler.rs | 516   | +1   |
+-----------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/sync.rs         | 466   | +33  |
+-----------------------------------------------------------------------+-------+------+

Total lines added: +34
Total lines removed: 0
Total lines changed: 34

@@ -29,12 +27,21 @@ use crate::{
snap::encodable_to_proof,
};
use tracing::info;
pub const PEER_REPLY_TIMOUT: Duration = Duration::from_secs(45);
pub const PEER_REPLY_TIMEOUT: Duration = Duration::from_secs(5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if instead of having a constant for all requests, it should be more fine grained. I guess there are going to be some snap requests that are larger and need a larger timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In full syncing, requests are typically fulfilled very quickly (well under 5 seconds). If a peer takes that long to respond, it's likely unstable, and we should consider removing it. For comparison, geth applies the same timeout to all requests (see here), but their limit is much higher: 30 seconds.

The problem is that we don’t have a great way to filter "well-behaved" peers, we just select them randomly. When we have only a small number of peers and allow timeouts of 30 or 45 seconds (as before), it significantly slows down syncing if one of them is taking too much time to respond. The real fix for this should come in #2073.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.

@@ -165,6 +168,8 @@ impl SyncManager {
// This step is not parallelized
let mut all_block_hashes = vec![];
// Check if we have some blocks downloaded from a previous sync attempt
// We only check for snap syncing as full sync will start fetching headers starting from the canonical block
// which is update every time a new block is added which is done as we fetch block headers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is a bit weird, I don't really understand it. Just to clarify, canonical blocks are not updated when you add a block, but when you forkchoice update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, during full syncing, any newly added block is always set as the head of the chain. When we receive a fork choice update, we start from the canonical chain head, which, in full sync mode, corresponds to the most recently added block. So, for full syncing, we don't really need to have checkpoints as they are set when a new block is added.

Maybe we are re-requesting headers or block bodies, but IMO we should worry about it now.

I will update the comment tho, I aggre it is a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment in 017c7cc.

)
.await?
}
// Full sync stores and executes blocks as it ask for the headers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Full sync stores and executes blocks as it ask for the headers
// Full sync stores and executes blocks as it asks for the headers

Does Geth also start downloading bodies before it finishes downloading headers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because by only downloading block headers and verifying them (that the chain is correct) you can have an "optimistic head". This is probably out of scope of this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geth runs multiple fetchers in parallel, they have a processes that fetches headers and another that validates it and pushes it into a queue so that a third process fetches the body and executes it to insert it into a sidechain, which at some point is set as the head. See here.

I agree that requesting all headers and validating them could be better. Currently, we have no way to retrieve block headers by a particular order (by block number for example). Since our table maps BlockHash => BlockHeader, fetching them in order isn't so trivial. I guess one solution might be to redesign the table so the key is a tuple of (BlockHash, BlockNumber).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved typo in 017c7cc.

return Err(error.into());
/// Requests block bodies from peers via p2p, executes and stores them
/// Returns an error if there was a problem while executing or validating the blocks
async fn download_and_run_blocks(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, what do you think of moving full sync specific logic to it's own file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be a great refactor, right now we have a bunch of ifs to check the sync mode that makes the code rather uncomfortable to read IMO.

let block = Block::new(header, body);
if let Err(error) = self.blockchain.add_block(&block) {
warn!("Failed to add block during FullSync: {error}");
self.invalid_ancestors.insert(hash, last_valid_hash);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially you are setting H256::default() here, does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as defined in invalid ancestors, when the last valid hash is not known, it is defined in the engine API to return 0x0 or null as the latest_valid_hash. See 3. here.

return Err(error.into());
}
store.set_canonical_block(number, hash)?;
store.update_latest_block_number(number)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying it should be done on this PR, but ideally, this would call forkchoice_updated function (or a set_head) so that we don't repeat this logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, this should be refactored.

@mpaulucci
Copy link
Collaborator

great work 🚀 🚀 🚀 . Just a few minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants