Dealing with Missing Time Series in Prometheus

September 13, 2023 by Julius Volz

Due to Prometheus' dimensional data model and the way it tracks and collects metrics, some of the metrics referenced in your PromQL expressions may sometimes be missing. Let's have a look what causes missing series, how they can cause trouble for your dashboards and alerting rules, and what you can do about it.

Why are time series sometimes missing?

There are two major causes for missing time series:

When nothing has happened yet for a given label value

Consider a labeled counter metric that tracks a number of operations, partitioned by the type of operation:

operations_total{optype="<type>"}

Since the instrumentation client library cannot guess the possible values for the optype label without you telling it, the series for each operation type will only start existing once the first operation of that particular type has occurred. And when the service starts up, it will initially not be able to report any series at all for the operations_total metric until at least some operation happens.

This applies to any metric with custom instrumentation labels. Metrics without labels do not have this issue, since the client library can initialize them to 0 or NaN (depending on the metric type) and expose them immediately after startup.

When a target is down or absent

Metrics can also be missing because a target is down, or because Prometheus is not even trying to scrape the target at all for some reason (for example, when the service discovery integration is broken). We will look at this in a separate blog post and only focus on the case of missing label dimensions here.

Problems caused by missing time series

Imagine a PromQL query for the total operations rate:

sum without(optype) (rate(operations_total{job="my-job"}[5m]))

If no operation has happened at all yet, the expression will return an empty result instead of a rate with the value of 0, as you may have expected.

Or if some operations have happened already, but you are trying to query for a specific operation type that has not happened yet (create in this example), you can also run into trouble:

rate(operations_total{job="my-job",optype="create"}[5m])

This will also give you an empty result where you may have expected a 0-valued rate output for the create operation type.

In a Grafana dashboard, this means that you will see a "No data" warning message instead of a rate with a value of 0. But if you tried to use this expression in an alerting rule, the alert may even silently fail to fire, since Prometheus interprets an empty alerting rule expression output as "everything is fine"! This is less of a problem when you alert on too-high rates of something (since the label value in question will definitely exist if there is a high rate for it), but it can bite you if you are trying to catch too-low rates.

Dealing with missing time series

Depending on the situation, there are a few ways to deal with missing time series:

When feasible: pre-initialize series!

If you only have a small number of possible values for a label that you know beforehand, the best course of action is to pre-initialize each value to 0 right after program startup. For example, if we only had the create, read, update, and delete operation types, we could initialize series for each of them like this in Go:

for _, val := range []string{"create", "read", "update", "delete"} {
    operationsTotal.WithLabelValues(val)
}

Note that we are just referencing the labeled series within the metric here, but we do not call the .Inc() method on it yet – this just initializes the counter series with an initial value of 0 for each operation type, but does not increment it yet.

Use or to join in a default value

Sometimes you may not know all possible label values beforehand, or there may be too many of them to pre-initialize them all. For example, consider a metric that tracks the number of HTTP requests by status code, and you want to calculate the rate for a specific error code, like 503:

sum by(job, instance) (rate(http_requests_total{job="my-job",status="503"}[5m]))

If no request with a 503 response status has happened yet, this will return an empty result. But it probably doesn't make sense to proactively create series for all possible status codes, since that would create a lot of time series that will usually not be needed.

In cases like this, you can either just accept the fact that your dashboards will sometimes show "No data" warnings, avoid writing alerting rules that can silently fail when a series is missing, or you can use the or set operator to join in a default value for the missing series using the up metric for the same targets as the series you are querying for:

  sum by(job, instance) (rate(http_requests_total{job="my-job",status="503"}[5m]))
or
  up{job="my-job"} * 0

The up metric is convenient for this purpose since it is always present for targets that are currently being scraped, and it has the advantage that it has compatible target labels. This makes label matching easy and gives you the expected result labels. The example above will now either return the actual request rate from the http_requests_total metric (if any 503 requests have already happened), or 0 (since we force the up metric to a value of 0 via multiplication).

Conclusion

In this blog post, we looked at how missing time series can cause trouble in your dashboards and alerting rules, and what you can do about it. Hopefully the tips in this article will help you avoid some of the pitfalls of missing series in your instrumentation code and in your PromQL expressions. To learn more Prometheus fundamentals from the ground up, take a look at our Prometheus training courses!


September 13, 2023 by Julius Volz

Tags: prometheus, time series, promql, instrumentation