Pass on errors from send-remote-write relation #261

pengwyn · 2025-02-25T01:41:44Z

Enhancement Proposal

Sometimes a bug can appear in the relation data that grafana-agent sends to the upstream prometheus via the send-remote-write relation. This could be due to grafana-agent itself or from another charm related to it on the cos-agent relation.

When there is a bug, and the alert rules are invalid, upstream prometheus will validate the rules and bail out. Prometheus will only apply a subset of the rules: the rules in the list before the first error. Prometheus will also place the error message back on the relation databag, so grafana-agent has an opportunity to read it. This error looks something like:

  - relation-id: 164
    endpoint: send-remote-write
    cross-model: true
    related-endpoint: receive-remote-write
    application-data:
      event: '{"errors": "error validating /tmp/tmpum0cmrcz/validate_rule.yaml: [875:11:
        group \"openstack_b958335a_grafana_agent_host_HostDisk_alerts\", rule 3, \"HostDiskSpace\":
        could not parse expression: 1:25: parse error: unknown escape sequence U+002E
        ''.'']"}'

I'd like to propose that grafana-agent goes into a blocked state when it notices that prometheus has replied to it with an error on the send-remote-write relation. I argue that this is "blocked" because it is unknown how many of the alert rules were applied upstream (it could be none). This would both allow quicker identification by operators and also for CI checks to fail.

As a practical example, the situation which spawned this idea was this:

We noticed that hardware-observer rules weren't being received by upstream
Looking at the juju status output, everything appeared fine
Looking at the relation databag we saw that there was an error with backslashes in the HostDiskSpace rule. This rule appeared before the hardware-observer rules in the list.
We could also see a prometheus error in the databag.
Updating grafana-agent to include the fix fix: escape in disk.rules #236 allowed the rules to be accepted.
The hardware-observer alert rules then appeared again.

Related to this issue, I believe that the error message returned by prometheus in the databag was never cleared after it received a valid set of rules. This might mean that the prometheus charm will need to clear that if it successfully parses its list of rules.

The text was updated successfully, but these errors were encountered:

lucabello · 2025-02-25T08:23:12Z

Thanks for the issue! Side note, we're currently mitigating this by linting alert rules in CI.

pengwyn added Status: Triage Type: Enhancement labels Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass on errors from send-remote-write relation #261

Pass on errors from send-remote-write relation #261

pengwyn commented Feb 25, 2025

lucabello commented Feb 25, 2025

Pass on errors from send-remote-write relation #261

Pass on errors from send-remote-write relation #261

Comments

pengwyn commented Feb 25, 2025

Enhancement Proposal

lucabello commented Feb 25, 2025