fix(l1): full sync #2061

MarcosNicolau · 2025-02-24T19:26:10Z

Motivation
Holesky syncing.

Description
The following changes were applied to fix full syncing:

Incremental Block Execution & Storage
- Block bodies are now executed and stored as we fetch block headers.
- Previously, we fetched all block headers first and then processed block bodies. This approach had a major flaw: if block processing failed at any point, we had to re-fetch all block headers from scratch.
Split block body requests in chunks
- Block body requests are now split into chunks of 128.
- Previously, all block bodies were requested in a single batch, which caused the stream to crash by exceeding the maximum allowed data size.
Reduced req size limit in block headers.
- We observed that while larger request values were supported, increasing them would lead to peer disconnections.

Finally, the peer timeout has been reduced to 5s (from 45s). The previous timeout was too high, causing nodes to get stuck when dealing with unstable peers. We should consider implementing peer penalization and scoring mechanisms instead of randomly selecting peers.

Considerations

The current full syncing process is very sequential:

fetches block headers -> fetches block bodies -> for each block -> validate block header + body -> store block -> set it as the head of the chain.

Moving forward, we could consider parallelizing the fetching process. However, it is not clear whether this would provide significant benefits, as requests are already fulfilled quickly. The real bottleneck lies in block execution and validation. Given this, I think that the potential gain in performance isn't worth the complexity of parallelizing fetching (at least for full syncing).

An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.

Closes None

github-actions · 2025-02-24T19:28:46Z

| File                                                                  | Lines | Diff |
+-----------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/peer_handler.rs | 516   | +1   |
+-----------------------------------------------------------------------+-------+------+
| /home/runner/work/ethrex/ethrex/crates/networking/p2p/sync.rs         | 466   | +33  |
+-----------------------------------------------------------------------+-------+------+

Total lines added: +34
Total lines removed: 0
Total lines changed: 34

crates/networking/p2p/sync.rs

crates/networking/p2p/peer_handler.rs

mpaulucci · 2025-02-26T16:22:11Z

crates/networking/p2p/peer_handler.rs

@@ -29,12 +27,21 @@ use crate::{
    snap::encodable_to_proof,
 };
 use tracing::info;
-pub const PEER_REPLY_TIMOUT: Duration = Duration::from_secs(45);
+pub const PEER_REPLY_TIMEOUT: Duration = Duration::from_secs(5);


wonder if instead of having a constant for all requests, it should be more fine grained. I guess there are going to be some snap requests that are larger and need a larger timeout.

In full syncing, requests are typically fulfilled very quickly (well under 5 seconds). If a peer takes that long to respond, it's likely unstable, and we should consider removing it. For comparison, geth applies the same timeout to all requests (see here), but their limit is much higher: 30 seconds.

The problem is that we don’t have a great way to filter "well-behaved" peers, we just select them randomly. When we have only a small number of peers and allow timeouts of 30 or 45 seconds (as before), it significantly slows down syncing if one of them is taking too much time to respond. The real fix for this should come in #2073.

An instant improvement is to verify the headers before fetching the block bodies to avoid requesting the block bodies in case we encounter an invalid one.

mpaulucci · 2025-02-26T16:31:50Z

crates/networking/p2p/sync.rs

@@ -165,6 +168,8 @@ impl SyncManager {
        // This step is not parallelized
        let mut all_block_hashes = vec![];
        // Check if we have some blocks downloaded from a previous sync attempt
+        // We only check for snap syncing as full sync will start fetching headers starting from the canonical block
+        // which is update every time a new block is added which is done as we fetch block headers


comment is a bit weird, I don't really understand it. Just to clarify, canonical blocks are not updated when you add a block, but when you forkchoice update

Currently, during full syncing, any newly added block is always set as the head of the chain. When we receive a fork choice update, we start from the canonical chain head, which, in full sync mode, corresponds to the most recently added block. So, for full syncing, we don't really need to have checkpoints as they are set when a new block is added.

Maybe we are re-requesting headers or block bodies, but IMO we should worry about it now.

I will update the comment tho, I aggre it is a bit confusing.

Updated comment in 017c7cc.

mpaulucci · 2025-02-26T16:54:33Z

crates/networking/p2p/sync.rs

-                )
-                .await?
-            }
+            // Full sync stores and executes blocks as it ask for the headers


Suggested change

// Full sync stores and executes blocks as it ask for the headers

// Full sync stores and executes blocks as it asks for the headers

Does Geth also start downloading bodies before it finishes downloading headers?

Because by only downloading block headers and verifying them (that the chain is correct) you can have an "optimistic head". This is probably out of scope of this PR though.

Geth runs multiple fetchers in parallel, they have a processes that fetches headers and another that validates it and pushes it into a queue so that a third process fetches the body and executes it to insert it into a sidechain, which at some point is set as the head. See here.

I agree that requesting all headers and validating them could be better. Currently, we have no way to retrieve block headers by a particular order (by block number for example). Since our table maps BlockHash => BlockHeader, fetching them in order isn't so trivial. I guess one solution might be to redesign the table so the key is a tuple of (BlockHash, BlockNumber).

Solved typo in 017c7cc.

mpaulucci · 2025-02-26T16:59:11Z

crates/networking/p2p/sync.rs

-                    return Err(error.into());
+    /// Requests block bodies from peers via p2p, executes and stores them
+    /// Returns an error if there was a problem while executing or validating the blocks
+    async fn download_and_run_blocks(


nit, what do you think of moving full sync specific logic to it's own file?

Yes, that would be a great refactor, right now we have a bunch of ifs to check the sync mode that makes the code rather uncomfortable to read IMO.

mpaulucci · 2025-02-26T17:00:50Z

crates/networking/p2p/sync.rs

+                    let block = Block::new(header, body);
+                    if let Err(error) = self.blockchain.add_block(&block) {
+                        warn!("Failed to add block during FullSync: {error}");
+                        self.invalid_ancestors.insert(hash, last_valid_hash);


potentially you are setting H256::default() here, does that make sense?

Yes, as defined in invalid ancestors, when the last valid hash is not known, it is defined in the engine API to return 0x0 or null as the latest_valid_hash. See 3. here.

mpaulucci · 2025-02-26T17:01:48Z

crates/networking/p2p/sync.rs

+                        return Err(error.into());
+                    }
+                    store.set_canonical_block(number, hash)?;
+                    store.update_latest_block_number(number)?;


Not saying it should be done on this PR, but ideally, this would call forkchoice_updated function (or a set_head) so that we don't repeat this logic

Yes I agree, this should be refactored.

mpaulucci · 2025-02-26T17:02:23Z

great work 🚀 🚀 🚀 . Just a few minor comments.

MarcosNicolau added 3 commits February 24, 2025 15:38

fix: full syncing

f2dd8b9

chore: lower peer timeout when requesting block headers and bodies

164e8b6

fix: max request lengths

88d62ad

MarcosNicolau requested a review from a team as a code owner February 24, 2025 19:26

chore: max bodies request constants

c1b7926

chore: address clippy warnings

86b8f54

avilagaston9 assigned MarcosNicolau Feb 24, 2025

avilagaston9 approved these changes Feb 24, 2025

View reviewed changes

Merge remote-tracking branch 'origin' into fix/full-sync

693950e

avilagaston9 reviewed Feb 25, 2025

View reviewed changes

crates/networking/p2p/sync.rs Show resolved Hide resolved

avilagaston9 reviewed Feb 25, 2025

View reviewed changes

crates/networking/p2p/sync.rs Show resolved Hide resolved

Merge branch 'main' into fix/full-sync

80c741e

MarcosNicolau mentioned this pull request Feb 25, 2025

fix(l1): full syncing holesky #1962

Closed

mpaulucci reviewed Feb 26, 2025

View reviewed changes

crates/networking/p2p/peer_handler.rs Outdated Show resolved Hide resolved

mpaulucci reviewed Feb 26, 2025

View reviewed changes

MarcosNicolau and others added 2 commits February 26, 2025 18:40

chore: chaja review comments

017c7cc

Merge branch 'main' into fix/full-sync

30a0ed5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): full sync #2061

fix(l1): full sync #2061

MarcosNicolau commented Feb 24, 2025 •

edited

Loading

github-actions bot commented Feb 24, 2025 •

edited

Loading

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci Feb 26, 2025

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci Feb 26, 2025

MarcosNicolau Feb 26, 2025

mpaulucci commented Feb 26, 2025

	// Full sync stores and executes blocks as it ask for the headers
	// Full sync stores and executes blocks as it asks for the headers

fix(l1): full sync #2061

Are you sure you want to change the base?

fix(l1): full sync #2061

Conversation

MarcosNicolau commented Feb 24, 2025 • edited Loading

github-actions bot commented Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpaulucci commented Feb 26, 2025

MarcosNicolau commented Feb 24, 2025 •

edited

Loading

github-actions bot commented Feb 24, 2025 •

edited

Loading