Errors, Successes, Totals: Which Metrics Should I Expose to Prometheus?

September 19, 2023 by Julius Volz

If your service is handling operations that can either succeed or fail, you may be wondering how to best expose them as Prometheus metrics. Should you expose errors and successes or errors and totals? Should you have a single metric split up by a label or multiple separate metrics? Let's have a look at these options and when you would use which.

First: Consider the use case!

To help answer the introductory question, let's first consider how these metrics would be used. For service monitoring, you will usually want to know the following:

  • The total rate of operations.
  • The error rate.
  • The error rate ratio (compared to all operations).

Note that querying success rates individually is not actually a common use case!

For each of these use cases, how (in)convenient would your PromQL queries become for the different metric designs?

Let's see what happens when we expose errors and successes. Consider two counters:

  • errors_total: The total number of errors.
  • successes_total: The total number of successes.

Looking at the desired use cases above, the PromQL queries would become cumbersome. For example, to get the total rate of operations, you would have to manually add the errors to the successes:

  rate(errors_total[5m])
+
  rate(successes_total[5m])

And to get the error rate ratio, you would have to write:

  rate(errors_total[5m])
/
  (rate(errors_total[5m]) + rate(successes_total[5m]))

Only querying the error rate is straightforward:

rate(errors_total[5m])

The PromQL queries become much simpler if we expose errors and totals instead:

  • errors_total: The total number of errors.
  • operations_total: The total number of operations.

Now, the total rate of operations is simply:

rate(operations_total[5m])

The error rate is:

rate(errors_total[5m])

And the error rate ratio is:

  rate(errors_total[5m])
/
  rate(operations_total[5m])

Thus if you only have binary outcomes (successes and errors without further differentation), exposing errors and totals is usually the best choice.

Exposing a single metric with a label

But what about exposing a single metric with an outcome (or similar) label? This depends on whether you only have clear binary results ("success" and "error") or a wide variety of possible outcomes, like HTTP response status codes.

For binary outcomes, a single metric with a label might look like this:

  • operations_total{outcome="error"}: The total number of errors.
  • operations_total{outcome="success"}: The total number of successes.

Calculating the total rate of operations now requires aggregating over the outcome label:

sum without(outcome) (rate(operations_total[5m]))

Calculating the error rate remains straightforward:

rate(operations_total{outcome="error"}[5m])

But calculating the error rate ratio becomes a bit more involved. You have to aggregate over the outcome label on both sides of the operation to make the label matching work (alternatively, you could use an ignoring(outcome) modifier on the binary operator):

  sum without(outcome) (rate(operations_total{outcome="error"}[5m]))
/
  sum without(outcome) (rate(operations_total[5m]))

This is clearly more cumbersome than the same query with two separate metric names. Metrics with labels also introduce the danger of missing time series, which we discussed in our previous blog post. In this example, the query above would return an empty result instead of a set of 0-valued series in case no errors have happened yet.

However, a single metric becomes the best choice when you don't just have a binary success vs. error outcome. For example, HTTP requests can have a wide range of response status codes, and you wouldn't want to have a separate metric for each of them:

  • http_requests_200_total: The total number of HTTP requests with a 200 response status code.
  • http_requests_404_total: The total number of HTTP requests with a 404 response status code.
  • http_requests_503_total: The total number of HTTP requests with a 503 response status code.
  • ...and so on.

Not only would this create a lot of different metric names that you would have to know about and query individually, but it would also make it really hard to query for the total rate:

  rate(http_requests_200_total[5m])
+
  rate(http_requests_404_total[5m])
+
  rate(http_requests_503_total[5m])
+
  # ...all the other possible status codes

This is clearly not a good idea. Instead, you would want to expose a single metric with a status label:

  • http_requests_total{status="200"}: The total number of HTTP requests with a 200 response status code.
  • http_requests_total{status="404"}: The total number of HTTP requests with a 404 response status code.
  • http_requests_total{status="503"}: The total number of HTTP requests with a 503 response status code.
  • ...and so on.

Then you can query for the total rate of HTTP requests like this:

sum without(status) (rate(http_requests_total[5m]))

And the 5xx error rate like this:

rate(http_requests_total{status=~"5.."}[5m])

And the total 5xx error rate ratio like this:

  sum without(status) (rate(http_requests_total{status=~"5.."}[5m]))
/
  sum without(status) (rate(http_requests_total[5m]))

So when you have a wide variety of possible outcomes, exposing a single metric with a label is usually the best choice.

Conclusion

In most cases, you should expose errors and totals as separate metrics. This makes it easy to query for the total rate of operations, the error rate, and the error rate ratio. But if you have more possible outcomes than just "success" and "error" (like for HTTP requests), you probably want to expose a single metric with a label instead.

If you found this article useful, you may also like our Prometheus training courses, in which we explore Prometheus fundamentals and best practices in depth and from the ground up.


September 19, 2023 by Julius Volz

Tags: prometheus, metrics, failures, errors, successes, totals, promql