[Collection] Control disk usage spikes #7058

zhangchiqing · 2025-02-18T21:54:41Z

Problem Definition

Devnet collection nodes have been experiencing significant disk usage spikes. Last week, disk usage surged from 64% to 94% approximately every 2.5 hours. The root cause of these spikes is BadgerDB compaction. To mitigate the risk of exceeding disk limits and causing downtime, we had to increase disk capacity.

During the investigation, I noticed that only Devnet collection nodes had such extreme spikes. While Mainnet nodes and other node types also had spikes, their spikes are typically under 20%, even with similar or smaller disk sizes.

I suspect the high disk spikes in Devnet collection nodes are due to the cluster running two consensus algorithms at a high speed:
• The Devnet collection cluster processes 3 blocks / second, while the consensus cluster builds 2 blocks / second.
• In contrast, the Mainnet collection cluster’s consensus only processes 1.2 blocks / second.

Additionally, there is a strong correlation between CPU, memory, and disk usage. The metrics below indicate that memory usage drops to ~12GB (~40%) after a disk usage spike, suggesting that once compaction completes, memory consumption decreases as well. Before the next compaction, memory usage gradually increases, likely due to new data being stored on disk and cached in memory.

Solution

We could consider pruning collection blocks from past epochs. While this may not directly reduce overall disk usage, it could still be beneficial in mitigating disk usage spikes.

To implement pruning, we can:
1. Iterate through the key space of each collection block.
2. Extract the epoch counter from the key.
3. Use the epoch counter to determine whether to retain or remove the block along with its associated cluster block data.

This pruning process can be executed via a utility command.

This could be part of the Protocol Data Pruning Epic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Collection] Control disk usage spikes #7058

[Collection] Control disk usage spikes #7058

zhangchiqing commented Feb 18, 2025

[Collection] Control disk usage spikes #7058

[Collection] Control disk usage spikes #7058

Comments

zhangchiqing commented Feb 18, 2025

Problem Definition

Solution