You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Devnet collection nodes have been experiencing significant disk usage spikes. Last week, disk usage surged from 64% to 94% approximately every 2.5 hours. The root cause of these spikes is BadgerDB compaction. To mitigate the risk of exceeding disk limits and causing downtime, we had to increase disk capacity.
During the investigation, I noticed that only Devnet collection nodes had such extreme spikes. While Mainnet nodes and other node types also had spikes, their spikes are typically under 20%, even with similar or smaller disk sizes.
I suspect the high disk spikes in Devnet collection nodes are due to the cluster running two consensus algorithms at a high speed:
• The Devnet collection cluster processes 3 blocks / second, while the consensus cluster builds 2 blocks / second.
• In contrast, the Mainnet collection cluster’s consensus only processes 1.2 blocks / second.
Additionally, there is a strong correlation between CPU, memory, and disk usage. The metrics below indicate that memory usage drops to ~12GB (~40%) after a disk usage spike, suggesting that once compaction completes, memory consumption decreases as well. Before the next compaction, memory usage gradually increases, likely due to new data being stored on disk and cached in memory.
Solution
We could consider pruning collection blocks from past epochs. While this may not directly reduce overall disk usage, it could still be beneficial in mitigating disk usage spikes.
To implement pruning, we can:
1. Iterate through the key space of each collection block.
2. Extract the epoch counter from the key.
3. Use the epoch counter to determine whether to retain or remove the block along with its associated cluster block data.
This pruning process can be executed via a utility command.
Problem Definition
Devnet collection nodes have been experiencing significant disk usage spikes. Last week, disk usage surged from 64% to 94% approximately every 2.5 hours. The root cause of these spikes is BadgerDB compaction. To mitigate the risk of exceeding disk limits and causing downtime, we had to increase disk capacity.
During the investigation, I noticed that only Devnet collection nodes had such extreme spikes. While Mainnet nodes and other node types also had spikes, their spikes are typically under 20%, even with similar or smaller disk sizes.
I suspect the high disk spikes in Devnet collection nodes are due to the cluster running two consensus algorithms at a high speed:
• The Devnet collection cluster processes 3 blocks / second, while the consensus cluster builds 2 blocks / second.
• In contrast, the Mainnet collection cluster’s consensus only processes 1.2 blocks / second.
Additionally, there is a strong correlation between CPU, memory, and disk usage. The metrics below indicate that memory usage drops to ~12GB (~40%) after a disk usage spike, suggesting that once compaction completes, memory consumption decreases as well. Before the next compaction, memory usage gradually increases, likely due to new data being stored on disk and cached in memory.
Solution
We could consider pruning collection blocks from past epochs. While this may not directly reduce overall disk usage, it could still be beneficial in mitigating disk usage spikes.
To implement pruning, we can:
1. Iterate through the key space of each collection block.
2. Extract the epoch counter from the key.
3. Use the epoch counter to determine whether to retain or remove the block along with its associated cluster block data.
This pruning process can be executed via a utility command.
This could be part of the Protocol Data Pruning Epic
The text was updated successfully, but these errors were encountered: