Metric Types in Prometheus and PromQL

Prometheus has the concept of different metric types: counters, gauges, histograms, and summaries. If you've ever wondered what these terms were about, this blog post is for you! We'll look at the meaning of each metric type, how to use it when instrumenting application code, how the type is exposed to Prometheus over HTTP, and what to watch out for when using metrics of different types in PromQL.

Overview

Prometheus differentiates between four types of metrics that are used to track aspects of a system. These types are:

  • Gauge: A gauge is for tracking current tallies, or things that can naturally go up or down, like memory usage, queue lengths, in-flight requests, or current CPU usage.
  • Counter: A counter is for tracking cumulative totals over a number of events or quantities like the total number of HTTP requests or the total number of seconds spent handling requests. Counters only decrease in value when the process that exposes them restarts, in which case they get reset to 0.
  • Histogram: A histogram is used to track the distribution of a set of observed values (like request latencies) across a set of buckets. It also tracks the total number of observed values, as well as the cumulative sum of the observed values.
  • Summary: A summary is used to track the distribution of a set of observed values (like request latencies) as a set of quantiles / percentiles. It also tracks the total number of observed values, as well as the cumulative sum of the observed values.

Below, we will look at how these metric types manifest themselves across the entire Prometheus pipeline, from instrumentation over exposition and processing, to PromQL:

Metric type handling across stages

Metric types in instrumentation

Metric types are most visible when you are instrumenting a service using one of the Prometheus client libraries, as each metric type's API object offers you methods specific to that type. Using the Prometheus Go client library as an example, let's take a look at how these types differ in usage. We will omit steps here that work the same across metric types, such as registering the metrics to be served via HTTP. We will also focus on metrics without any additional explicit label dimensions (like splitting requests up by endpoint, method, etc.).

Gauge metrics

Creating a gauge metric is as simple as giving it a name and documentation string:

queueLength := prometheus.NewGauge(prometheus.GaugeOpts{
	Name: "queue_length",
	Help: "The number of items in the queue.",
})

Gauge metrics are allowed to go up or down, and they can also be set to explicit values, so the gauge metric type exposes methods for that:

// Use Set() when you know the absolute value from some other source.
queueLength.Set(0)

// Use these methods when your code directly observes the increase or decrease of something, such as adding an item to a queue.
queueLength.Inc() // Increment by 1.
queueLength.Dec() // Decrement by 1.
queueLength.Add(23)
queueLength.Sub(42)

Since gauges are frequently used to expose Unix timestamps as sample values, there is also a convenience method to set a gauge to the current timestamp:

demoTimestamp.SetToCurrentTime()

Counter metrics

Creating a counter metric is similar to a gauge:

totalRequests := prometheus.NewCounter(prometheus.CounterOpts{
	Name: "http_requests_total",
	Help: "The total number of handled HTTP requests.",
})

But unlike gauges, counters may only increase over time, so you cannot set them to an arbitrary absolute value or decrease them (nor should you ever want to):

totalRequests.Inc()
totalRequests.Add(23)

Counters do reset to 0 when the service process restarts, but this is fine, as functions like rate() know how to handle this.

Histogram metrics

Creating a histogram metric is a bit more involved, as you will need to configure the number of buckets you want to categorize observations into, and the upper boundary of each bucket. Prometheus histograms are cumulative, meaning that each subsequent bucket contains the observation count of the previous bucket, or put into other words, all bucket lower boundaries start at zero. So you will not need to explicitly configure the lower bound of each bucket, only the upper bound:

requestDurations := prometheus.NewHistogram(prometheus.HistogramOpts{
  Name:    "http_request_duration_seconds",
  Help:    "A histogram of the HTTP request durations in seconds.",
  // Bucket configuration: the first bucket includes all requests finishing in 0.05 seconds, the last one includes all requests finishing in 10 seconds.
  Buckets: []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
})

Since enumerating all buckets can be tedious, there are helper functions like prometheus.LinearBuckets() and prometheus.ExponentialBuckets() that can help you generate linear or exponential bucketing schemes.

A histogram automatically categorizes and counts distributions of values for you, so it only has an Observe() method, which you call whenever you handle something to track in your code. For example, if you just handled an HTTP request which took 0.42 seconds, you would do:

requestDurations.Observe(0.42)

The Go client library provides helper functions for timing durations and then automatically observing them into a histogram:

timer := prometheus.NewTimer(requestDurations)
// [...Handle the request...]
timer.ObserveDuration()

Summary metrics

Creating and using a summary is similar to a histogram, except that you will need to specify which quantiles you want to track and don't have to deal with buckets. For example, if you wanted to track the 50th, 90th, and 99th percentiles across HTTP request latencies, you would create a summary like this:

requestDurations := prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Help:       "A summary of the HTTP request durations in seconds.",
    Objectives: map[float64]float64{
      0.5: 0.05,   // 50th percentile with a max. absolute error of 0.05.
      0.9: 0.01,   // 90th percentile with a max. absolute error of 0.01.
      0.99: 0.001, // 99th percentile with a max. absolute error of 0.001.
    },
  },
)

After creation, tracking durations works exactly like for histograms:

requestDurations.Observe(0.42)

A note on histograms vs. summaries

It is important to note that while histogram buckets can be aggregated across dimensions (such as endpoint, HTTP method, etc.), this is not statistically valid for summary quantiles! See also the best practices documentation about histogram and summary metrics to learn more about the tradeoffs of histograms and summaries, and how to choose histogram bucket boundaries correctly.

Exposition

Now let's see how each of our examples above would be serialized into the HTTP exposition format, which Prometheus sees when it scrapes a target.

Our gauge (with no explicit user-specified labels) gets rendered out as a single time series:

# HELP queue_length The number of items in the queue.
# TYPE queue_length gauge
queue_length 42

Our HTTP request counter also gets sent as a single series:

# HELP http_requests_total The total number of handled HTTP requests.
# TYPE http_requests_total counter
http_requests_total 7734

As you can see, the documentation string and metric type also make it into the exposition format.

However, histograms and summaries track more complex information that can't be expressed as a single series. Since Prometheus only understands individual time series when scraping, these two metric types need to be mapped into a set of time series that together encode all relevant information.

Our histogram example would end up being rendered like this:

# HELP http_request_duration_seconds A histogram of the HTTP request durations in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 4599
http_request_duration_seconds_bucket{le="0.1"} 24128
http_request_duration_seconds_bucket{le="0.25"} 45311
http_request_duration_seconds_bucket{le="0.5"} 59983
http_request_duration_seconds_bucket{le="1"} 60345
http_request_duration_seconds_bucket{le="2.5"} 114003
http_request_duration_seconds_bucket{le="5"} 201325
http_request_duration_seconds_bucket{le="+Inf"} 227420
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 227420

Each configured bucket ends up as one counter time series with a _bucket suffix, indicating the upper bound of that bucket using an le ("less-than-or-equal") label. An implicit bucket with the upper bound +Inf is also exposed to catch requests that took longer than the largest configured bucket boundary. Lastly, the exposition format includes the cumulative sum and count over all observations using the suffixes _sum and _count. Each of these time series is conceptually a counter (individual values that can only go up), although they were created as part of a histogram in the instrumentation library.

Our summary is exposed much like the histogram, except that a quantile label is used to indicate per-quantile series, and those series don't have a suffix that extends the metric name:

# HELP http_request_duration_seconds A summary of the HTTP request durations in seconds.
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.052
http_request_duration_seconds{quantile="0.90"} 0.564
http_request_duration_seconds{quantile="0.99"} 2.372
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 227420

How the Prometheus server processes metric types

For the longest time, the Prometheus server simply threw away any metric type (and documentation string) metadata when it scraped time series from a target. Starting with Prometheus 2.4.0, the server now stores this metadata in memory for each scrape target, and external users can query the metadata using an API endpoint. This is useful for user interfaces such as PromLens or Grafana that want to show additional help to users while building queries.

Other than tracking and exposing the metadata in this way, the Prometheus server itself does not do any further processing of this metadata yet. There is a proposal for propagating metric types to remote storage systems via Prometheus's remote_write protocol, but this has not been implemented yet.

Metric types in PromQL

As mentioned in our previous post on The Anatomy of a PromQL Query, PromQL surprisingly has no direct concept or knowledge of the four different metric types at all. All it knows on a fundamental level are flat time series with metric names, label sets, and samples (timestamp / value pairs) attached to them. However, PromQL includes some functions that expect the input series to look and behave like a specific metric type.

The most prominent examples of these metric-type-aware functions are:

  • rate() / irate() / increase() / resets(): These functions only work properly for counter metrics, as they interpret any decrease in value over time as a counter reset. In the case of the rate()-style functions, these resets will be approximated away and only positive rates are computed, leading to totally incorrect results when using these functions on gauge metrics.
  • delta() / idelta() / deriv() / predict_linear(): These functions only work properly for gauge metrics, as they treat increases and decreases in input metrics the same, and don't interpret decreases as counter resets.
  • histogram_quantile(): This function only works for histogram metrics: each input series needs to have a sensical le label to indicate the bucket boundary, and the cumulative value counts between buckets need to make sense.

There are no functions that are specific to summary metrics (as there is not much you can compute over a quantile value), but you can conceptually see the quantile values exported from a summary metric as a set of gauges.

PromQL functions can usually not figure out and complain automatically when you are passing in the wrong metric type, but we recommend following the Prometheus metric naming best practices to at least make it easy for humans to see what type of metric they are dealing with. For example, counters should end with _total, whereas gauges should not have such a suffix.

Conclusion

As we have seen, metric types are conceptually most visible when instrumenting services with a client library, become less pronounced when Prometheus ingests metrics, and then become implicitly relevant again when building queries in PromQL. Metric type metadata might be used and propagated more strongly in future Prometheus versions, but for now it is important to understand the differences between types as a user to build correct instrumentation and queries.


September 25, 2020 by Julius Volz

Tags: promql, data model, metric types, instrumentation