Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass on errors from send-remote-write relation #261

Open
pengwyn opened this issue Feb 25, 2025 · 1 comment
Open

Pass on errors from send-remote-write relation #261

pengwyn opened this issue Feb 25, 2025 · 1 comment

Comments

@pengwyn
Copy link

pengwyn commented Feb 25, 2025

Enhancement Proposal

Sometimes a bug can appear in the relation data that grafana-agent sends to the upstream prometheus via the send-remote-write relation. This could be due to grafana-agent itself or from another charm related to it on the cos-agent relation.

When there is a bug, and the alert rules are invalid, upstream prometheus will validate the rules and bail out. Prometheus will only apply a subset of the rules: the rules in the list before the first error. Prometheus will also place the error message back on the relation databag, so grafana-agent has an opportunity to read it. This error looks something like:

  - relation-id: 164
    endpoint: send-remote-write
    cross-model: true
    related-endpoint: receive-remote-write
    application-data:
      event: '{"errors": "error validating /tmp/tmpum0cmrcz/validate_rule.yaml: [875:11:
        group \"openstack_b958335a_grafana_agent_host_HostDisk_alerts\", rule 3, \"HostDiskSpace\":
        could not parse expression: 1:25: parse error: unknown escape sequence U+002E
        ''.'']"}'

I'd like to propose that grafana-agent goes into a blocked state when it notices that prometheus has replied to it with an error on the send-remote-write relation. I argue that this is "blocked" because it is unknown how many of the alert rules were applied upstream (it could be none). This would both allow quicker identification by operators and also for CI checks to fail.

As a practical example, the situation which spawned this idea was this:

  • We noticed that hardware-observer rules weren't being received by upstream
  • Looking at the juju status output, everything appeared fine
  • Looking at the relation databag we saw that there was an error with backslashes in the HostDiskSpace rule. This rule appeared before the hardware-observer rules in the list.
  • We could also see a prometheus error in the databag.
  • Updating grafana-agent to include the fix fix: escape in disk.rules #236 allowed the rules to be accepted.
  • The hardware-observer alert rules then appeared again.

Related to this issue, I believe that the error message returned by prometheus in the databag was never cleared after it received a valid set of rules. This might mean that the prometheus charm will need to clear that if it successfully parses its list of rules.

@lucabello
Copy link
Contributor

Thanks for the issue! Side note, we're currently mitigating this by linting alert rules in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants