Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Collection] Control disk usage spikes #7058

Open
zhangchiqing opened this issue Feb 18, 2025 · 0 comments
Open

[Collection] Control disk usage spikes #7058

zhangchiqing opened this issue Feb 18, 2025 · 0 comments

Comments

@zhangchiqing
Copy link
Member

Problem Definition

Devnet collection nodes have been experiencing significant disk usage spikes. Last week, disk usage surged from 64% to 94% approximately every 2.5 hours. The root cause of these spikes is BadgerDB compaction. To mitigate the risk of exceeding disk limits and causing downtime, we had to increase disk capacity.

Image

Image

During the investigation, I noticed that only Devnet collection nodes had such extreme spikes. While Mainnet nodes and other node types also had spikes, their spikes are typically under 20%, even with similar or smaller disk sizes.

I suspect the high disk spikes in Devnet collection nodes are due to the cluster running two consensus algorithms at a high speed:
• The Devnet collection cluster processes 3 blocks / second, while the consensus cluster builds 2 blocks / second.
• In contrast, the Mainnet collection cluster’s consensus only processes 1.2 blocks / second.

Additionally, there is a strong correlation between CPU, memory, and disk usage. The metrics below indicate that memory usage drops to ~12GB (~40%) after a disk usage spike, suggesting that once compaction completes, memory consumption decreases as well. Before the next compaction, memory usage gradually increases, likely due to new data being stored on disk and cached in memory.

Image

Solution

We could consider pruning collection blocks from past epochs. While this may not directly reduce overall disk usage, it could still be beneficial in mitigating disk usage spikes.

To implement pruning, we can:
1. Iterate through the key space of each collection block.
2. Extract the epoch counter from the key.
3. Use the epoch counter to determine whether to retain or remove the block along with its associated cluster block data.

This pruning process can be executed via a utility command.

This could be part of the Protocol Data Pruning Epic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant