PromLabs | Blog - Comparing PromQL Correctness Across Vendors

Comparing PromQL Correctness Across Vendors

August 06, 2020 by Julius Volz

UPDATE: See our follow-up blog post from December 1 for the latest round of PromQL vendor test results.

TL;DR: See the details of all PromQL compliance test results on our PromQL Compliance Test Results page or in the table below:

Test target	Passing	Cross-cutting issues	More details
Cortex	521 / 523 cases (99.62%)	1	Detailed results
Thanos	523 / 523 cases (100.00%)	1	Detailed results
TimescaleDB	523 / 523 cases (100.00%)	None	Detailed results
VictoriaMetrics	356 / 523 cases (68.07%)	1	Detailed results

Prometheus' query language PromQL is one of the most important interfaces in the Prometheus ecosystem. Organizations rely on it to build their dashboards, critical alerting, and automation. It enables flexible data processing, but it also includes subtle semantics that are important to get right. With Prometheus becoming more popular, we are seeing more and more PromQL implementations outside of Prometheus itself, both in open-source long-term storage projects and in hosted monitoring providers. Some of these PromQL implementations reuse parts of the Prometheus code base, while others opt for a complete reimplementation from scratch.

Seeing PromQL being adopted in external products and services is great! At the same time, we believe it is in the interest of the Prometheus community for implementations claiming to be PromQL-compatible to behave exactly like native PromQL. The reasons for this are:

Avoiding user confusion, where a user might expect two PromQL implementations to behave the same when they don't. In the worst case, this can lead to broken alerting and undetected outages.
Avoiding ecosystem fragmentation, similar to how there are now many diverging SQL dialects despite the existence of an SQL standard.
Avoiding vendor lock-in, where a user is unable to switch between long-term storage providers, because they started to rely on specific query language quirks of that provider.

In this blog post, we want to start shining a light onto the correctness of different PromQL implementations, with two goals in mind: creating transparency for users (enabling them to make better choices), and making it easier for projects or vendors to spot and fix bugs in their PromQL implementations. For this purpose, we started building an open-source PromQL compliance testing tool that checks the correctness of a given PromQL implementation and generates a test report. We'll dive into the details of this tool in this post, as well as some first comparison results.

What is "correct" in the absence of a specification?

While PromQL is documented on the Prometheus website, the documentation is not specific enough to serve as a full specification of all subtle behaviors. In the absence of a spec, how do we judge correctness? Since we care about portability and compatibility between implementations, we treat the actual exhibited behavior of the native PromQL engine as the specification (while taking cues from its code to consider edge cases to test). So to determine whether a given PromQL implementation executes a specific query correctly, we can run the query against both the native PromQL engine and the vendor implementation and compare the results. Given a comprehensive-enough set of test queries, this should at least unveil many common (or less common) misbehaviors.

The PromQL Compliance Tester

The PromQL Compliance Tester is an open-source tool which runs a set of configured range queries against a reference PromQL HTTP API endpoint (usually this would be a Prometheus server) and a test endpoint (the vendor implementation):

PromQL Compliance Tester architecture

The tester expects you to pre-load both storage implementations with the same data, so that query results become comparable. It then runs a deep comparison of the returned query results for each test case, and records any differences. Finally, it outputs a text or HTML report, including a summary of the number of tests that have passed and failed, and the exact differences for failed cases.

An example configuration file in the repository contains a set of PromQL queries that aim to test many prominent features of the language. The configuration file allows templating of queries to allow introducing automatic combinatorial variations of function names, operator types, duration ranges, etc.. This allows for easier creation of test cases, without having to manually write out all possible query variants for a pattern.

Handling different types of behavior deviations

Not all differences in behavior are created equal. Depending on your use case, some differences may even be acceptable, while others are problematic violations. Also, some behavioral differences are specific to a single PromQL function, while others affect the overall structure of all queries. The latter may require applying general query parameter restrictions or output transformations, in order to still be able to compare results further. How to judge such differences is ultimately up to the user, but it's important to make those differences transparent in the comparison output report.

We will detail some of these cases and how we handled them further down.

First comparisons

While the PromQL Compliance Tester is still in an early stage, it is already useful to get an overall idea of the correctness of different implementations. For this iteration, we ran a Prometheus server configured against a set of demo metrics targets:

global:
  scrape_interval: 5s

scrape_configs:
- job_name: 'demo'
  static_configs:
    - targets:
      - 'demo.promlabs.com:10000'
      - 'demo.promlabs.com:10001'
      - 'demo.promlabs.com:10002'

With the data collected by this configuration, we ran the query patterns provided in the example tester configuration file.

For now, we tested Cortex, Thanos, VictoriaMetrics, and TimescaleDB. To replicate the same data into each of these systems, we either added a remote_write configuration to the above Prometheus configuration file, or in the case of Thanos, we simply pointed the compliance tester at the Thanos querier. This does not test Thanos' long-term storage integration, but we also didn't differentiate storage tiers in the other providers either yet.

Cortex

Cortex reuses the PromQL library from Prometheus, while swapping out the underlying storage implementation. This makes it a likely candidate for good compliance. We tested Cortex version 1.2.0. Let's see how it fared in practice.

First, the tester uncovered that Cortex was affected by a small query start/end timestamp parsing rounding bug that had already been resolved in Prometheus a while ago. The bug has just been fixed in Cortex now, so the next Cortex release should not have this issue anymore. To make query results comparable for now, we enabled a tester option to only choose integer start/end query timestamps for queries and thus achieve identical parsing.

After accounting for this issue, we saw 99.62% of test cases passing against Cortex.

While no test cases failed in the result comparison state, two test queries failed to execute at all:

QUERY: {mode="idle", instance!="demo.promlabs.com:10000"}
START: 2020-08-05 17:16:32 +0200 CEST, STOP: 2020-08-05 17:26:32 +0200 CEST, STEP: 10s
RESULT: FAILED: Query failed unexpectedly: execution: expanding series: query must contain metric name

And:

QUERY: histogram_quantile(0.9, {__name__=~"demo_api_request_duration_seconds_.+"})
START: 2020-08-05 17:16:32 +0200 CEST, STOP: 2020-08-05 17:26:32 +0200 CEST, STEP: 10s
RESULT: FAILED: Query failed unexpectedly: execution: expanding series: query must contain metric name

The reason for these failures is that Cortex does not support vector selectors without an exact metric name, such as queries that use regular expressions against the metric name. This is a known issue due to the way that Cortex indexes metrics and might get addressed at some point. In practice, this limitation will mostly affect debug-style queries, such as counting how many time series exist per job, per target, or per metric name (e.g. sort_desc(count by(__name__, job) ({__name__=~".+"}))). It is less likely to affect regular dashboard or alerting queries, and the failure will be explicit.

See the full results for Cortex.

Thanos

Thanos also reuses the upstream PromQL implementation. This makes it another likely candidate for near-100% compliance. Let's see if this ends up being true. We tested Thanos v0.14.0 by running a Thanos Sidecar against our reference Prometheus server and pointing a Thanos Query component at it.

Thanos ended up having the same query timestamp parsing rounding bug as Cortex. We accounted for this in the same way as before. We also had to apply another test result transformation tweak to remove external labels in query results that had to be added for Thanos.

After accounting for these issues, we saw 100% of test cases passing against Thanos.

See the full results for Thanos.

VictoriaMetrics

As the only contender in this group, VictoriaMetrics does not base its PromQL support on Prometheus' upstream implementation. Instead, VictoriaMetrics implements its own MetricsQL language that offers features beyond standard PromQL, but claims to be "backwards compatible with PromQL" as well. A few differences to native PromQL are already called out in their MetricsQL documentation. Let's see how these appear in the tests, and what other differences we may discover.

First, it turns out that VictoriaMetrics does not support arbitrary and millisecond-precision query start/end timestamps as in native PromQL, but rather aligns all input timestamps to be multiples of the range query's resolution timestamp. To be more precise, it apparently only does so under certain conditions. To compensate for this cross-cutting issue and to be able to compare results further, we configured the tester to only query for already-aligned timestamps in the first place.

After applying the general timestamp tweak across all queries, we still saw only 68% of test cases passing against VictoriaMetrics.

The differences in VictoriaMetrics query results stem from a number of issues:

As documented, range windows like [5m] behave differently in VictoriaMetrics and always select a sample before the window as well. This even causes some over-time aggregations (like sum_over_time(demo_cpu_usage_seconds_total[1s])) to return results where Prometheus returns none. It also causes generally different numeric results.
Hexadecimal scalar literals like 0x3d cannot be parsed.
Any time series with a value of NaN is removed in results, instead of being included with an NaN sample value. NaN series can still be very relevant in expressions, so this one could pose a real problem. Apparently this is a known or intended behavior.
Some queries succeed in VictoriaMetrics that fail in Prometheus. For example, series selectors without at least one non-empty matcher ({__name__=~".*"}) or negative offsets (demo_cpu_usage_seconds_total offset -1m). Since MetricsQL advertises itself as a superset of PromQL, some of these differences are likely intentional and expected. However, label_replace() also succeeds in VictoriaMetrics when supplying an invalid target label name, which might lead to problems. We opened an issue for this.
Some queries don't handle illegal parameter values the same as native PromQL, such as calculating quantiles with an invalid quantile parameter (quantile(-0.5, demo_cpu_usage_seconds_total)) returning a normal number instead of NaN.
Binary filter operators between a scalar and a vector (e.g. 0.12345 < demo_cpu_usage_seconds_total) must always return the sample value from the vector side of the operation, no matter which side of the operator it is on. VictoriaMetrics always returns the sample value from the left-hand side, even if it is the scalar value. We filed an issue for this as well.
Some functions like avg_over_time(), round(), etc. incorrectly keep the metric name of the input series where Prometheus deletes it (to signal that the time series no longer has the original metric name's meaning). We filed another issue.
Subqueries yield different numeric values, maybe due to the same reason as vector selectors.
Unary and power operator precedence seems to be inverted, such that -1^2 yields -1 in Prometheus, but 1 in VictoriaMetrics. We filed another issue.

Some of these issues are minor and are unlikely to cause trouble in day-to-day queries, while others (like NaN series being filtered out or wrong sample values being calculated) may cause significant problems. We hope that VictoriaMetrics will improve on these points, or otherwise adjust their claims around PromQL compatibility.

See the full results for VictoriaMetrics.

TimescaleDB

TimescaleDB (via its Prometheus adapter) also reuses the upstream PromQL implementation, putting it yet again into a good position to easily achieve close compliance.

When starting to run tests against TimescaleDB's PromQL implementation, there was no official release of it yet. In the process of testing, we discovered some query errors that we found hard to explain (like metric names missing on some selected time series). Since we were already coincidentally communicating with the TimescaleDB developers at this time, they managed to find and fix the underlying bugs quickly (including a caching issue) and in that process also cut a first beta release (0.1.0-beta.1) including the PromQL support.

With this first beta release, we saw 100% of test cases passing against TimescaleDB, without needing any further cross-cutting query corrections.

See the full results for TimescaleDB.

Future work

It's still early days for the PromQL Compliance Tester. In particular, we would love to add and improve the following points:

Test instant queries in addition to range queries.
Add more variation and configurability to input timestamps.
Flesh out a more comprehensive (and less overlapping) set of input test queries.
Automate and integrate data loading into different systems.
Test more vendor implementations of PromQL (for example, Sysdig and New Relic).
Version test results and make pretty output presentations easier.

Note: Many people will be interested in benchmarking performance differences between PromQL implementations. While this is important as well, the PromQL Compliance Tester focuses solely on correctness testing.

If you would like to help flesh out the tester, please file issues or pull requests against the PromQL Compliance Tester repository!

Conclusion

We hope that these early results are already useful for creating a more correct and transparent PromQL implementation landscape. Already in these early tests, the tool has managed to find a number of bugs in all tested implementations. We also hope to publish regular updates on vendor implementation correctness.

To view all test results, head to our PromQL Compliance Tests page, available from our "Resources" menu item at the top starting today.

August 06, 2020 by Julius Volz

Tags: promql, compliance, testing, vendors, cortex, thanos, victoriametrics, timescaledb