Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Push wip deployment best practices #379

Merged
merged 3 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/advanced/adv-docker-configs.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 10
sidebar_position: 12
description: Use advanced docker-compose features to have more flexibility and power to change the default configuration.
---

Expand Down
95 changes: 95 additions & 0 deletions docs/advanced/deployment-best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
sidebar_position: 11
description: DV Deployment best practices, for running an optimal Distributed Validator setup at scale.
---

# Deployment Best Practices

The following are a selection of best practices for deploying Distributed Validator Clusters at scale on mainnet.


## Hardware Specifications

The following specifications are recommended for bare metal machines for clusters intending to run a significant number of mainnet validators:

### Minimum Specs

- A CPU with 4+ cores, favouring high clock speed over more cores. ( >3.0GHz and higher or a cpubenchmark [single thread](https://www.cpubenchmark.net/singleThread.html) score of >2,500)
- 16GB of RAM
- 2TB+ free SSD disk space (for mainnet)
- 10mb/s internet bandwidth

### Recommended Specs for extremely large clusters

- A CPU with 8+ physical cores, with clock speeds >3.5Ghz
- 32GB+ RAM (depending on the EL+CL clients)
- 4TB+ NVMe storage
- 25mb/s internet bandwidth

An NVMe storage device is **highly recommended for optimal performance**, offering nearly 10x more random read/writes per second than a standard SSD.

Inadequate hardware (low-performance virtualized servers and/or slow HDD storage) has been observed to hinder performance, indicating the necessity of provisioning adequate resources. **CPU clock speed and Disk throughput+latency are the most important factors for running a performant validator.**

Note that the Charon client itself takes less than 1GB of RAM and minimal CPU load. In order to optimize both performance and cost-effectiveness, it is recommended to prioritize physical over virtualized setups. Such configurations typically offer greater performance and minimize overhead associated with virtualization, contributing to improved efficiency and reliability.

When constructing a DV cluster, it is important to be conscious of whether a cluster runs across cloud providers or stays within a single provider's private networking. This likely can impact the bandwidth and latency of the connections between nodes, as well as the egress costs of the cluster (Charon has a relatively low communication with its peers, averaging 10s of kb/s in large mainnet clusters). Ideally, bare metal machines in different locations within the same continent and with at least two providers, balances redundancy and performance.

## Intra-cluster Latency

It is recommended to **keep peer ping latency below 235 milliseconds for all peers in a cluster**. Charon should report a consensus duration averaging under 1 second through its prometheus metric `core_consensus_duration_seconds_bucket` and associated grafana panel titled "Consensus Duration".

In cases where latencies exceed these thresholds, efforts should be made to reduce the physical distance between nodes or optimize Internet Service Provider (ISP) settings accordingly. Ensure all nodes are connecting to one another directly rather than through a relay.

For high-scale, performance deployments; inter-peer latency of <25ms is optimal, along with an average consensus duration under 100ms.

## Node Locations

For optimal performance and high availability, it is recommended to provision machines or virtual machines (VMs) within the same continent. This practice helps minimize potential latency issues ensuring efficient communication and responsiveness. Consider maps of [undersea internet cables](https://www.submarinecablemap.com/) when selecting locations across oceans with low latency.

## Peer Connections

Charon clients can establish connections with one another in two ways: either through a third publicly accessible server known as [a relay](../charon/charon-cli-reference.md#host-a-relay) or directly with one another if they can establish a connection. The former is known as a relay connection and the latter is known as a direct connection.

It is important that all nodes in a cluster be directly connected to one another - this can halve the latency between them and reduces bandwidth constraints significantly. Opening Charon’s p2p port (the default is `3610`) to the Internet, or configuring your routers NAT gateway to permit connections to your Charon client, are what are required to facilitate a direct connection between clients.

## Instance Independence

Each node in the cluster should have its own independent beacon node (EL+CL) and validator client as well as Charon client. Sharing beacon nodes between the different nodes would potentially impact the fault tolerance of the cluster and as a result should be avoided.

## Placement of Charon clients

If you wish to divide a Distributed Validator node across multiple physical or virtual machines; locaate the Charon client on the EL/CL machine instead of the VC machine. This setup reduces latency from Charon to the consensus layer, as well as keeping the public-internet connected clients separate from the clients that hold the validator private keys. Be sure to use encrypted communication between your VC and the Charon client, potentially through a cloud-provided network, a self-managed network tunnel, a VPN, a Kubernetes [CNI](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/), or other manner.

## Node Configuration

Cluster sizes that allow for Byzantine Fault Tolerance are recommended as they are safer than clusters with simply Crash Fault Tolerance (See this guide for reference - [Cluster Size and Resilience](../charon/cluster-configuration#cluster-size-and-resilience)).

## MEV-Boost Relays

MEV relays are configured at the Consensus Layer or MEV-boost client level. Refer to our [guide](./quickstart-builder-api.md) to ensure all necessary configuration has been applied to your clients. As with all validators, low latency during proposal opportunities is extremely important. By default, MEV-Boost waits for all configured relays to return a bid, or will timeout if any have not returned a bid within 950ms. This default timeout is generally too slow for a distributed cluster (think of this time as additive to the time it takes the cluster to come to consensus, both of which need to happen within a 2 second window for optimal proposal broadcasting). It is likely better to only list relays that are located geographically near your node, so that once all relays respond (e.g. in < 50ms) your cluster will move forward with the proposal.

## Client Diversity

The clusters should consist of a combination of your preferred consensus, execution, and validator clients. It is recommended to include a combination of multiple clients in order to have a healthy client diversity within the cluster, ideally, if any single client type fails, it should be less than the fault tolerance of the cluster, and the validators should stay online/not do anything slashable.

Remote signers can be included as well, such as Web3signer or Dirk. A diversity of private key infrastructure setups further reduces the risk of total key compromise.

Tested client combinations can be found in the [release notes](https://github.com/ObolNetwork/charon/releases) for each Charon version.

## Metrics Monitoring

As requested by Obol Labs, node operators can push [standard monitoring](./obol-monitoring.md) (Prometheus) and logging (Loki) data to Obol Labs' core team's cloud infrastructure for in-depth analysis of performance data and to assist during potential issues that may arise. Our recommendation for operators is to independently store information on their node health over the course of the validator lifecycle as well as any information on validator performance that they collect during the normal life cycle of a validator.

## Obol Splits

Leveraging [Obol Splits](../sc/introducing-obol-splits.md) smart contracts allows for non-custodial fund handling and allows for net customer payouts in an ongoing manner. Obol Splits ensure no commingling of funds across customers, and maintain full non-custodial integrity. Read more about Obol Splits [here](../faq/general.md#obol-splits).

## Deposit Process

Deposit processes can be done via an automated script. This can be used for DV clusters until they reach the desired number of validators.

It is important to allow time for the validators to be activated (usually <24 hours).

Consider using batching smart contracts to reduce the gas cost of a script, but take caution in their integration not to make an invalid deposit.


2 changes: 1 addition & 1 deletion docs/advanced/quickstart-combine.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 9
sidebar_position: 8
description: Combine distributed validator private key shares to recover the validator private key.
---

Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/quickstart-split.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 8
sidebar_position: 7
description: Split existing validator keys
---

Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/self-relay.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 7
sidebar_position: 9
description: Self-host a relay
---

Expand Down
24 changes: 12 additions & 12 deletions docs/advanced/test-command.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
---
sidebar_position: 1
description: Test networking between Charon dependant services
sidebar_position: 5
description: Test the performance of a candidate Distributed Validator Cluster setup.
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Network tests
# Test a Cluster

:::caution
The charon test command is in alpha state and is still in development. It can not do any harm, but there is no guarantee it is stable and working as expected.
The `charon alpha test` command is in an alpha state and is subject to change until it is made available as `charon test` in a future version.
:::

Charon test command evaluates network performance and effectiveness of the machine it is running on and the targeted external service - other Charon peers, beacon node(s), etc.. It prints a performance report to the standard output and a machine-readable TOML format of the report if `output-toml` flag is set.
Charon test commands are designed to help you evaluate the performance and readiness of your candidate cluster. It allows you to test your connection to other Charon peers, the performance of your beacon node(s), and the readiness of your validator client. It prints a performance report to the standard output (which can be omitted by with the `--quiet` flag) and a machine-readable TOML format of the report if the `--output-toml` flag is set.

## Peers
## Test your connection to peers

Run tests towards other Charon peers to evaluate the effectiveness of a potential cluster setup. The command sets up a libp2p node, similarly to what Charon originally does. This test command **has to be running simultaneously with other peers**. After the node is up it waits for other peers to get their nodes up and running, retrying connection every 3 seconds. The libp2p node connects to relays (configurable with `p2p-relays` flag) and to other libp2p nodes via TCP. Other peer nodes are discoverable by using their ENRs. Note that for a peer to be successfully discovered, it needs to be connected to the same relay. After completion of the test suite the libp2p node stays alive (duration configurable with `keep-alive` flag) for other peers to continue testing against it. The node can be forcefully stopped as well.
Run tests towards other Charon peers to evaluate the effectiveness of a potential cluster setup. The command sets up a libp2p node, similarly to what Charon normally does. This test command **has to be running simultaneously with the other peers**. After the node is up it waits for other peers to get their nodes up and running, retrying the connection every 3 seconds. The libp2p node connects to relays (configurable with `p2p-relays` flag) and to other libp2p nodes via TCP. Other peer nodes are discoverable by using their ENRs. Note that for a peer to be successfully discovered, it needs to be connected to the same relay. After completion of the test suite the libp2p node stays alive (duration configurable with `keep-alive` flag) for other peers to continue testing against it. The node can be forcefully stopped as well.

To be able to establish direct connection, you have to ensure:

Expand All @@ -28,9 +28,9 @@ If all points are satisfied by you and the other peers, you should be able to es

### Pre-requisites

- [Create ENR](../charon/charon-cli-reference#creating-an-enr-for-charon).
- Share your ENR with other peers which will test against you.
- Obtain the ENRs of the other peers against which you will test.
- [Create an ENR](../charon/charon-cli-reference#creating-an-enr-for-charon).
- Share your ENR with the other peers which will test with you.
- Obtain the ENRs of the other peers with which you will test.

### Run

Expand All @@ -46,9 +46,9 @@ docker run -v /Users/obol/charon/.charon:/opt/charon/.charon obolnetwork/charon:
--enrs="enr:-HW4QNDXi9MzdH9Af65g20jDfelAJ0kJhclitkYYgFziYHXhRFF6JyB_CnVnimB7VxKBGBSkHbmy-Tu8BJq8JQkfptiAgmlkgnY0iXNlY3AyNTZrMaEDBVt5pk6x0A2fjth25pjLOEE9DpqCG-BCYyvutY04TZs,enr:-HW4QO2vefLueTBEUGly5hkcpL7NWdMKWx7Nuy9f7z6XZInCbFAc0IZj6bsnmj-Wi4ElS6jNa0Mge5Rkc2WGTVemas2AgmlkgnY0iXNlY3AyNTZrMaECR9SmYQ_1HRgJmNxvh_ER2Sxx78HgKKgKaOkCROYwaDY"
```

## Beacon
## Test your beacon node

Run tests towards beacon node(s), evaluating the effectiveness of a potential connection of a Charon node running on the same machine.
Run tests towards your beacon node(s), to evaluate its effectiveness for a Distributed Validator cluster.

### Pre-requisites

Expand Down
Loading