Using Metrics-Based Custom Thresholds in Prometheus Alerting Rules

April 04, 2024 by Julius Volz

Often you will want to have different alerting thresholds for different label combinations. For example, you may want to have individual thresholds for the error rates of each HTTP path and method combination, rather than a single hardcoded threshold value for all of them. Luckily, Prometheus allows you to write generic alerting rules that you can parametrize based on a joined-in metric with compatible labels. Let's have a look at how this works!

The problem: Single, fixed thresholds

Imagine that you have an http_requests_total metric that counts the total number of HTTP requests your service handles, broken out by path, method, and status labels:

http_requests_total{path="/foo",method="GET",status="200"} 77443
http_requests_total{path="/foo",method="GET",status="500"} 65934
http_requests_total{path="/foo",method="POST",status="200"} 123
http_requests_total{path="/foo",method="POST",status="500"} 0
http_requests_total{path="/bar",method="GET",status="200"} 123
http_requests_total{path="/bar",method="GET",status="500"} 0
http_requests_total{path="/bar",method="POST",status="200"} 34
http_requests_total{path="/bar",method="POST",status="500"} 28923

NOTE: In reality, your time series will likely also be split up by target labels such as job and instance, but for simplicity we'll pretend that those don't exist here.

Now you want to set up an alert that fires when the rate of requests with a response status code of 500 exceeds a certain threshold.

In the simplest case, you could write an alerting rule with a fixed (hard-coded) threshold that determines when the alert should fire. For example, the following expression would produce alerts for any label combination where the 500 error rate exceeds 5 requests per second:

rate(http_requests_total{status="500"}[5m]) > 5

This works great as long as "5 per second" is a reasonable threshold for each path and method combination. However, in reality you might want to set different thresholds for each label combination, depending on the individual importance and characteristics of each path and method in question.

You could write a separate alerting rule for each combination, but that would be cumbersome and hard to maintain.

The solution: Custom thresholds based on time series

Instead, you can write a single alerting rule and parametrize it for each path and method combination. For that, you will first need to have another metric (in this example, http_error_rate_threshold) that represents an individual custom threshold for each label combination:

http_error_rate_threshold{path="/foo",method="GET"} 10  # 10 per second.
http_error_rate_threshold{path="/foo",method="POST"} 2  # 2 per second.
http_error_rate_threshold{path="/bar",method="GET"} 3   # 3 per second.
http_error_rate_threshold{path="/bar",method="POST"} 1  # 1 per second.

Then you can write a single generic alerting rule that joins in this metric to determine when the alert should fire:

  rate(http_requests_total{status="500"}[5m])
> on(path, method)
  http_error_rate_threshold

This expression compares the individual error rates to the matching custom threshold for each label combination. If a given rate exceeds its threshold, the alert will fire for that label set.

NOTE: The on(path, method) modifier ensures that the matching only happens based on the path and method labels, and not on the additional status label that is only present on the error rates and not on the threshold series. Depending on your data and alerting expression, you will have to choose the right set of labels to match on, and potentially also add a group_left() modifier after the on(...) in case multiple series from the first operand can match a single series from your threshold metric (many-to-one matching).

Adding a default fallback threshold for missing label combinations

In some scenarios you may actually want to fall back to a hardcoded default threshold if no custom value is defined for a specific label combination in the http_error_rate_threshold metric. You can achieve this by using the or operator to join in a default threshold value for missing label combinations:

  rate(http_requests_total{status="500"}[5m])
> on(path, method)
  (
      http_error_rate_threshold
    or on(path, method) # Join in default threshold.
      http_requests_total{status="500"} * 0 + 5
  )

As the second argument to the or operator, we choose a set of time series that will always contain all desired label combinations (in this case, http_requests_total{status="500"} is does the job), and then we force the value for each label combination to the default threshold of 5 by multiplying each series with 0 and adding 5.

Generating the custom threshold time series

One question remains: How do you actually get the http_error_rate_threshold metric with your custom threshold combinations? There is no magic involved here, and it depends on your needs. You generally have a few options:

  • You could expose the custom thresholds from the same process that exposes the main metric you are alerting on.
  • You could write a separate exporter that exposes the custom thresholds (potentially even by consulting a database and changing the thresholds over time).
  • You could use a set of recording rules to record your custom thresholds for each desired label combination.

The choice depends on your specific use case and how you want to manage the thresholds.

As an example, here is how you could generate the http_error_rate_threshold metric using a set of recording rules:

groups:
- name: HTTP error rate thresholds
  rules:
  - record: http_error_rate_threshold
    labels:
      path: "/foo"
      method: "GET"
    expr: 10
  - record: http_error_rate_threshold
    labels:
      path: "/foo"
      method: "POST"
    expr: 2
  - record: http_error_rate_threshold
    labels:
      path: "/bar"
      method: "GET"
    expr: 3

Note that the above rule set does not define a threshold for the {path="/bar", method="POST"} label combination. This will cause the alerting rule to fall back to the default threshold of 5 for this combination.

Conclusion

While hard-coded thresholds may be sufficient for many alerting rules, having the ability to parametrize a single generic alerting rule for many different label combinations can be very useful. You now know how to use a correlated metric to dynamically set individual thresholds for different label combinations, and how you can fall back to a default threshold if no custom threshold is defined for a given label combination.


April 04, 2024 by Julius Volz

Tags: alerting, thresholds, promql