Skip to content

Commit

Permalink
Prepare for v1 release (#380)
Browse files Browse the repository at this point in the history
* Remove beta software cautions

* Update charon versions, linting, typos, CLI updates

* Rename charon to Charon on all necessary places

* Capitalise Prometheus and Grafana everywhere needed

* Add Gnosis Chain

Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>

* Remove Goerli

Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>

* Revert public dashboard

---------

Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>
  • Loading branch information
KaloyanTanev and OisinKyne authored Jun 24, 2024
1 parent cf3e15f commit 2dc0e2b
Show file tree
Hide file tree
Showing 39 changed files with 655 additions and 590 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,21 @@ This website is built using [Docusaurus 2](https://docusaurus.io/), a modern sta

### Installation

```
```shell
$ yarn
```

### Local Development

```
```shell
$ yarn start
```

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

### Build

```
```shell
$ yarn build
```

Expand All @@ -46,7 +46,7 @@ Control/Command+F the `/docs/` folder for the current version, and update all re

Now you are ready to create the next version by running the following command.

```bash
```shell
yarn run version v0.5.0
```

Expand Down Expand Up @@ -113,7 +113,7 @@ module.exports = {

Copy the `docs/intro.md` file to the `i18n/fr` folder:

```bash
```shell
mkdir -p i18n/fr/docusaurus-plugin-content-docs/current/

cp docs/intro.md i18n/fr/docusaurus-plugin-content-docs/current/intro.md
Expand All @@ -125,7 +125,7 @@ Translate `i18n/fr/docusaurus-plugin-content-docs/current/intro.md` in French.

Start your site on the French locale:

```bash
```shell
npm run start -- --locale fr
```

Expand Down Expand Up @@ -167,13 +167,13 @@ The locale dropdown now appears in your navbar:

Build your site for a specific locale:

```bash
```shell
npm run build -- --locale fr
```

Or build your site to include all the locales at once:

```bash
```shell
npm run build
```

Expand Down
8 changes: 4 additions & 4 deletions docs/advanced/adv-docker-configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: Use advanced docker-compose features to have more flexibility and p
# Advanced Docker Configs

:::info
This section is intended for *docker power users*, i.e., for those who are familiar with working with `docker-compose` and want to have more flexibility and power to change the default configuration.
This section is intended for *docker power users*, i.e.: for those who are familiar with working with `docker compose` and want to have more flexibility and power to change the default configuration.
:::

We use the "Multiple Compose File" feature which provides a very powerful way to override any configuration in `docker-compose.yml` without needing to modify git-checked-in files since that results in conflicts when upgrading this repo.
Expand All @@ -16,15 +16,15 @@ There are some additional compose files in [this repository](https://github.com/

- `compose-debug.yml` contains some additional containers that developers can use for debugging, like `jaeger`. To achieve this, you can run:

```
```shell
docker compose -f docker-compose.yml -f compose-debug.yml up
```

- `docker-compose.override.yml.sample` is intended to override the default configuration provided in `docker-compose.yml`. This is useful when, for example, you wish to add port mappings or want to disable a container.

- To use it, just copy the sample file to `docker-compose.override.yml` and customise it to your liking. Please create this file ONLY when you want to tweak something. This is because the default override file is empty and docker errors if you provide an empty compose file.

```
```shell
cp docker-compose.override.yml.sample docker-compose.override.yml

# Tweak docker-compose.override.yml and then run docker compose up
Expand All @@ -33,6 +33,6 @@ docker compose up

- You can also run all these compose files together. This is desirable when you want to use both the features. For example, you may want to have some debugging containers AND also want to override some defaults. To achieve this, you can run:

```
```shell
docker compose -f docker-compose.yml -f docker-compose.override.yml -f compose-debug.yml up
```
56 changes: 32 additions & 24 deletions docs/advanced/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,40 +4,48 @@ description: Add monitoring credentials to help the Obol Team monitor the health
---
# Monitoring your Node

This comprehensive guide will assist you in effectively monitoring your Charon clusters and setting up alerts by running your own Prometheus and Grafana server. If you want to use Obol’s [public dashboard](https://grafana.monitoring.gcp.obol.tech/d/d895e47a-3c2d-46b7-9b15-8f31202681af/clusters-aggregate-view?orgId=6) instead of running your servers, refer to [this section](./obol-monitoring.md) in Obol docs that teaches you how to push Prometheus metrics to Obol.
This comprehensive guide will assist you in effectively monitoring your Charon clusters and setting up alerts by running your own Prometheus and Grafana server. If you want to use Obol’s [public dashboard](https://grafana.monitoring.gcp.obol.tech/d/d895e47a-3c2d-46b7-9b15-8f31202681af/clusters-aggregate-view?orgId=6) instead of running your servers, refer to [this section](./obol-monitoring.md) in Obol docs that teaches you how to push Prometheus metrics to Obol.

To explain quickly, Prometheus generates the metrics and Grafana visualizes them. To learn more about prometheus and Grafana, visit [here](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/). If you are using **[CDVN repository](https://github.com/ObolNetwork/charon-distributed-validator-node)** or **[CDVC repository](https://github.com/ObolNetwork/charon-distributed-validator-cluster)**, then Prometheus and Grafana are part of docker compose file and will be installed when you run `docker compose up`.
To explain quickly, Prometheus generates the metrics and Grafana visualizes them. To learn more about Prometheus and Grafana, visit [here](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/). If you are using **[CDVN repository](https://github.com/ObolNetwork/charon-distributed-validator-node)** or **[CDVC repository](https://github.com/ObolNetwork/charon-distributed-validator-cluster)**, then Prometheus and Grafana are part of docker compose file and will be installed when you run `docker compose up`.

The local Grafana server will have a few pre-built dashboards -
The local Grafana server will have a few pre-built dashboards:

1. Charon Overview : This is the main dashboard that provides all the relavant details about the charon node, for example, peer connectivity, duty completion, health of beacon node and downstream validator etc. To open, navigate to `charon-distributed-validator-node` directory and open the following `uri` in the browser `http://localhost:3000/d/d6qujIJVk/` .
2. Single Charon Node Dashboard (deprecated) - This is an older dashboard charon node monitoring which is now deprecated. If you are still using it, we would highly recommend to move to Charon Overview for most up to date panels.
3. Charon Log Dashboard - This dashboard can be used to query the logs emitted while running your charon node . It utilizes [Grafana Loki](https://grafana.com/oss/loki/). This dashboard is not active by default and should only be used in debug mode. Refer to [advanced docker config](./adv-docker-configs) section on how to set up a debug mode.
1. Charon Overview

| Alert Name | Description | Trouble shoot |
This is the main dashboard that provides all the relavant details about the Charon node, for example - peer connectivity, duty completion, health of beacon node and downstream validator, etc. To open, navigate to `charon-distributed-validator-node` directory and open the following `uri` in the browser `http://localhost:3000/d/d6qujIJVk/`.

2. Single Charon Node Dashboard (deprecated)

This is an older dashboard Charon node monitoring which is now deprecated. If you are still using it, we would highly recommend to move to Charon Overview for most up to date panels.

3. Charon Log Dashboard

This dashboard can be used to query the logs emitted while running your Charon node. It utilises [Grafana Loki](https://grafana.com/oss/loki/). This dashboard is not active by default and should only be used in debug mode. Refer to [advanced docker config](./adv-docker-configs) section on how to set up a debug mode.

| Alert Name | Description | Troubleshoot |
| --- | --- | --- |
| ClusterBeaconNodeDown | This alert is activated when the beacon node in a the cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster | Most likely data is corrupted. Wipe data from the point you know data was corrupted and restart beacon node so it can be synced again. |
| ClusterMissedAttestations | This alert indicates that there have been missed attestations in the cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. | This alert is triggered when 3 attestation are missed in 2 minutes. Check if threshold peers are online. If correct, check beacon node api error and downstream validator errors using Loki. Lastly, debug the docker using docker compose debug. |
| ClusterInUnknownStatus |  This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. | This is most likely a bug in charon. report to us via https://discord.com/channels/849256203614945310/970759460693901362. |
| ClusterInsufficientPeers | This alert is set to activate when the number of peers for a node in the cluster is insufficient. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz equals 4. | If you are running group cluster, check with other peers to troubleshoot issue. If you are running solo cluster, look into other machines running the DVs to find the problem, |
| ClusterFailureRate | This alert is activated when the failure rate of the cluster exceeds a certain threshold, more specifically, more than 5% failures in duties in the last 6 hours. | Check the upstream and downstream dependencies, latency and hardware issues. |
| ClusterVCMissingValidators | This alert is activated if any validators in the cluster are missing. This happens when validator client cannot load validator keys in past 10 minutes. | Find if validator keys are missing and load them. |
| ClusterHighPctFailedSyncMsgDuty | This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 10%. | This may be due to limitations in beacon node performance on nodes within the cluster. In charon, this duty is the most demanding, however an increased failure rate does not impact rewards. |
| ClusterNumConnectedRelays | This alert is activated if the number of connected relays in the cluster falls to 0. | Make sure correct relay is configured. If you still get the error report to us via https://discord.com/channels/849256203614945310/970759460693901362. |
| ClusterBeaconNodeDown | This alert is activated when the beacon node in a the cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. | Most likely data is corrupted. Wipe data from the point you know data was corrupted and restart beacon node so it can be synced again. |
| ClusterMissedAttestations | This alert indicates that there have been missed attestations in the cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. | This alert is triggered when 3 attestation are missed in 2 minutes. Check if the minimum threshold of peers are online. If correct, check for beacon node API errors and downstream validator errors using Loki. Lastly, debug from Docker using `docker compose debug`. |
| ClusterInUnknownStatus | This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the `app_monitoring_readyz` metric is 0. | This is most likely a bug in Charon. Report to us via [Discord](https://discord.com/channels/849256203614945310/970759460693901362). |
| ClusterInsufficientPeers | This alert is set to activate when the number of peers for a node in the cluster is insufficient. The condition is evaluated by checking whether the maximum of the `app_monitoring_readyz` equals 4. | If you are running group cluster, check with other peers to troubleshoot the issue. If you are running solo cluster, look into other machines running the DVs to find the problem. |
| ClusterFailureRate | This alert is activated when the failure rate of the cluster exceeds a certain threshold, more specifically - more than 5% failures in duties in the last 6 hours. | Check the upstream and downstream dependencies, latency and hardware issues. |
| ClusterVCMissingValidators | This alert is activated if any validators in the cluster are missing. This happens when validator client cannot load validator keys in the past 10 minutes. | Find if validator keys are missing and load them. |
| ClusterHighPctFailedSyncMsgDuty | This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 10%. | This may be due to limitations in beacon node performance on nodes within the cluster. In charon, this duty is the most demanding, however, an increased failure rate does not impact rewards. |
| ClusterNumConnectedRelays | This alert is activated if the number of connected relays in the cluster falls to 0. | Make sure correct relay is configured. If you still get the error report to us via [Discord](https://discord.com/channels/849256203614945310/970759460693901362). |
| PeerPingLatency | This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 400ms within 2 minutes. | Make sure to set up stable and high speed internet connection. If you have geographically distributed nodes, make sure latency does not go over 250 ms. |
| ClusterBeaconNodeZeroPeers | This alert is activated when beacon node cannot find peers. | Go to docs of beacon node client to trouble shoot. Make sure if there is no port overlap and p2p discovery is open. |
| ClusterBeaconNodeZeroPeers | This alert is activated when beacon node cannot find peers. | Go to docs of beacon node client to troubleshoot. Make sure there is no port overlap and p2p discovery is open. |

## Setting Up a Contact Point

When alerts are triggered, they are routed to contact points according notification policies. For this, contact points must be added. Grafana supports several kind of contact points like email, pager duty, discord, slack, telegram etc. This document will teach how to add discord channel as contact point.
When alerts are triggered, they are routed to contact points according notification policies. For this, contact points must be added. Grafana supports several kind of contact points like email, PagerDuty, Discord, Slack, Telegram etc. This document will teach how to add Discord channel as contact point.

1. On left nav bar in grafana console, under `Alerts` section, click on contact points.
1. On left nav bar in Grafana console, under `Alerts` section, click on contact points.
2. Click on `+ Add contact point`. It will show following page. Choose Discord in the `Integration` drop down.

![AlertsContactPoint](../../static/img/AlertsContactPoint.png)

3. Give a descriptive name to the alert. Create a channel in Discord and copy its `webhook url`. Once done, click `Save contact point` to finish.
4. When the alerts are fired, it will send without filling in the variables for cluster detail. For example, cluster_hash variable is missing here `cluster_hash = {{.cluster_hash}}`. This is done to save disk space. To find the details, use `docker compose -f docker-compose.yml -f compose-debug.yml up`. More description [here](https://docs.obol.tech/docs/advanced/adv-docker-configs).
4. When the alerts are fired, it will send without filling in the variables for cluster detail. For example, `cluster_hash` variable is missing here `cluster_hash = {{.cluster_hash}}`. This is done to save disk space. To find the details, use `docker compose -f docker-compose.yml -f compose-debug.yml up`. More description [here](https://docs.obol.tech/docs/advanced/adv-docker-configs).

## Best Practices for Monitoring Charon Nodes & Cluster

Expand Down Expand Up @@ -77,6 +85,6 @@ When alerts are triggered, they are routed to contact points according notificat

It is also important to check:

- NTP clock skew
- Process restarts and failures (eg. through `node_systemd`)
- alert on high error and panic log counts.
- NTP clock skew;
- Process restarts and failures (eg. through `node_systemd`);
- Alert on high error and panic log counts.
8 changes: 5 additions & 3 deletions docs/advanced/obol-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ description: Add monitoring credentials to help the Obol Team monitor the health
This is **optional** and does not confer any special privileges within the Obol Network.
:::

You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central Prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.

The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
```

```shell
obol20tnt8UC...
```

The updated `prometheus/prometheus.yml` file should look like:

```yaml
global:
scrape_interval: 30s # Set the scrape interval to every 30 seconds.
Expand Down Expand Up @@ -44,4 +46,4 @@ scrape_configs:
- job_name: "lodestar"
static_configs:
- targets: [ "lodestar:5064" ]
```
```
Loading

0 comments on commit 2dc0e2b

Please sign in to comment.