PromQL Queries for Exploring Your Metrics

December 17, 2020 by Julius Volz

When building PromQL queries, a frequent need for users is to get an idea of the data they are working with, and a way to explore the space of all available metrics and their labels. While the UI tooling for this is still evolving (PromLens and Grafana both have some data exploration features), you can also map out your data using PromQL queries.

Getting all series in the TSDB

The simplest (but not usually practical) possibility is to just list all series that have recent data. There are multiple ways to do this. For example, you could write a selector for any series that has a non-empty metric name (which is true for all series in Prometheus). Internally, the metric name is stored in a special label with the name __name__, so you could just query for:

{__name__!=""}

Caution: This returns all active series, so on a large Prometheus this may overload either your server or your UI.

This can make sense on small Prometheus servers, where it's feasible to digest and manually explore all series at once. Most of the time though, you will want to filter or aggregate a bit further before taking a look at the output data.

Getting all series for a metric name

This is the most common use case when starting out with a query. If you already know which metric name you are working with (e.g. demo_cpu_usage_seconds_total) and just want to see all associated series and their label pairs, you could query for:

demo_cpu_usage_seconds_total

This should work well most of the time, unless there is a huge number of series attached to the metric name.

Getting all metric names

You may also want to get a list of all metric names first. If your UI does not already show you this list in one way or another (it really should!), you can query for it by first selecting all series, then grouping by the special metric name label __name__:

group by(__name__) ({__name__!=""})

Note: A more efficient, non-PromQL way to get the same information is to use Prometheus's label values metadata endpoint to get all values for the __name__ label, e.g.: https://demo.promlabs.com/api/v1/label/name/values. This is the method that UIs typically use to present you with a list of metric names. While the PromQL query above gives you the metric names of all series with recent (<5m old) data, this metadata API endpoint will by default give you all metric names that are known to the TSDB at any time.

Getting all values for a specific label

Sometimes you are already working with a particular label name on a given metric and you may just be wondering what all of its values are. Let's say you wanted to list all possible values for the mode label on the metric name demo_cpu_usage_seconds_total. Then you could query for:

group by(mode) (demo_cpu_usage_seconds_total)

You could even do this across all metric names / all series:

group by(mode) ({__name__!=""})

Note: In this latter case, you may again use the label values API endpoint (but this time for the mode label) instead of a PromQL query.

Breaking up series cost by job, instance, metric name, ...

As a Prometheus server administrator, the total number of time series that a server needs to keep track of is one of the main memory and scaling bottlenecks to watch out for. To get a feeling where "all those series" are coming from, you can write PromQL queries to count how many series there are for a given job, instance (target), metric name, or other type of dimensional grouping. Here are just a few examples of these types of debug queries, sorted by largest count first:

Number of series per metric name:

sort_desc(count by(__name__) ({__name__!=""}))

Number of series per target:

sort_desc(count by(instance) ({__name__!=""}))

Number of series per job and metric name combination:

sort_desc(count by(job, __name__) ({__name__!=""}))

You may already suspect that a particular label name has many different values. To count how many different values the le histogram bucket label has in a given histogram metric (e.g. demo_api_request_duration_seconds_bucket), you could query for:

count(group by(le) (demo_api_request_duration_seconds_bucket))

...and you can take this further to count how many series there are for any dimensional combination, such as the le label per job in the same histogram:

count(group by(le, job) (demo_api_request_duration_seconds_bucket))

This way, you are able to get a good idea about the cost split-up of your metrics.

Getting the meaning and type of a metric name

For understanding the semantic meaning of your metrics, their documentation string and metric type information (counter, gauge, histogram, summary) are very useful. This information is not directly accessible via PromQL, but the Prometheus server keeps it in memory for each target and it can be queried via another metadata API endpoint. For example: https://demo.promlabs.com/api/v1/metadata.

To make this data more accessible, PromLens offers a built-in metrics explorer that allows fuzzy-searching across all metric names and that shows their metadata:

This metrics explorer is still pretty simple, but we have a lot of ideas for evolving it into a more powerful metrics exploration toolkit!

Conclusion

While there is still ongoing work in metrics exploration UIs in Prometheus, there are a lot of PromQL queries you can run in the meantime to get a better overview over your metrics and their cost. We hope that this article will be useful the next time you are looking at the data available in a Prometheus server!


December 17, 2020 by Julius Volz

Tags: promql, exploration, metrics, promlens