You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes a bug can appear in the relation data that grafana-agent sends to the upstream prometheus via the send-remote-write relation. This could be due to grafana-agent itself or from another charm related to it on the cos-agent relation.
When there is a bug, and the alert rules are invalid, upstream prometheus will validate the rules and bail out. Prometheus will only apply a subset of the rules: the rules in the list before the first error. Prometheus will also place the error message back on the relation databag, so grafana-agent has an opportunity to read it. This error looks something like:
I'd like to propose that grafana-agent goes into a blocked state when it notices that prometheus has replied to it with an error on the send-remote-write relation. I argue that this is "blocked" because it is unknown how many of the alert rules were applied upstream (it could be none). This would both allow quicker identification by operators and also for CI checks to fail.
As a practical example, the situation which spawned this idea was this:
We noticed that hardware-observer rules weren't being received by upstream
Looking at the juju status output, everything appeared fine
Looking at the relation databag we saw that there was an error with backslashes in the HostDiskSpace rule. This rule appeared before the hardware-observer rules in the list.
We could also see a prometheus error in the databag.
The hardware-observer alert rules then appeared again.
Related to this issue, I believe that the error message returned by prometheus in the databag was never cleared after it received a valid set of rules. This might mean that the prometheus charm will need to clear that if it successfully parses its list of rules.
The text was updated successfully, but these errors were encountered:
Enhancement Proposal
Sometimes a bug can appear in the relation data that grafana-agent sends to the upstream prometheus via the send-remote-write relation. This could be due to grafana-agent itself or from another charm related to it on the cos-agent relation.
When there is a bug, and the alert rules are invalid, upstream prometheus will validate the rules and bail out. Prometheus will only apply a subset of the rules: the rules in the list before the first error. Prometheus will also place the error message back on the relation databag, so grafana-agent has an opportunity to read it. This error looks something like:
I'd like to propose that grafana-agent goes into a blocked state when it notices that prometheus has replied to it with an error on the send-remote-write relation. I argue that this is "blocked" because it is unknown how many of the alert rules were applied upstream (it could be none). This would both allow quicker identification by operators and also for CI checks to fail.
As a practical example, the situation which spawned this idea was this:
Related to this issue, I believe that the error message returned by prometheus in the databag was never cleared after it received a valid set of rules. This might mean that the prometheus charm will need to clear that if it successfully parses its list of rules.
The text was updated successfully, but these errors were encountered: