An Update on PromQL Compatibility Across Vendors

December 01, 2020 by Julius Volz

TL;DR: See the details of all PromQL compliance test results on our PromQL Compliance Test Results page or in the table below:

Test target	Passing	Cross-cutting issues	More details
Chronosphere (hosted)	525 / 525 cases (100%)	None	Detailed results
Cortex (v1.5.0 - chunks)	523 / 525 cases (99.62%)	None	Detailed results
Cortex (v1.5.0 - blocks)	525 / 525 cases (100.00%)	None	Detailed results
Grafana Cloud (hosted)	525 / 525 cases (100.00%)	1	Detailed results
M3 (v1.0.0)	525 / 525 cases (100.00%)	None	Detailed results
MetricFire (hosted)	523 / 525 cases (99.62%)	1	Detailed results
New Relic (hosted)	163 / 525 cases (31.05%)	2	Detailed results
Promscale (0.1.2)	525 / 525 cases (100.00%)	None	Detailed results
Thanos (v0.17.0)	525 / 525 cases (100.00%)	None	Detailed results
VictoriaMetrics (v1.47.0)	312 / 525 cases (59.43%)	1	Detailed results

Prometheus's query language PromQL is one of the cornerstones of the Prometheus ecosystem. Back in our August blog post, Comparing PromQL Correctness Across Vendors, we looked at several external projects and monitoring vendors that claimed to offer PromQL-compatible APIs. Using an open-source compliance checker tool, we evaluated each implementation by running a set of test queries against both the native Prometheus server and the vendor implementation. As a result, we found multiple bugs in the tested projects, and mapped out in detail how each deviated from the upstream implementation.

Meanwhile, more vendors have entered the PromQL space, while existing projects have worked on improving their PromQL support as a result of our last blog post. Today, we are presenting an updated set of PromQL test results both for existing projects and a couple of new ones. We tested each of these implementations against the latest Prometheus version, v2.23.0.

Before we start: a note on interpreting test scores

When looking at individual test results, please note that the numeric scores alone paint a limited picture. They don't necessarily tell you how impactful implementation errors are, nor how many distinct behavioral differences there are. Since PromQL expressions naturally compose and build on top of each other, it is infeasible to isolate all possible query errors into fully independent test cases. Many of the test queries will contain the same constructs as other queries, in a nested way. For example: If an implementation selects data incorrectly, then this breaks not only a simple data selection test query, but also all other test query cases that try to run a function, operator, or other transformation based on selected data. The test cases also include a lot of variations of the same type of query, but with different ranges, quantile values, or other parameters. So if e.g. a function is broken in a general way, it may cause multiple test case variants to fail at the same time.

Another matter of more subtle interpretation are vendor extensions and extreme edge-case behavior. For example, a vendor may choose to not raise an error when executing a function with a parameter value that is illegal in native PromQL, but legal and sensical in the extended language. Or a vendor may return a different value for an unusual edge-case behavior, such as running histogram_quantile() with a quantile value larger than 1.

We still provide the numeric test scores as a quick way to map out the vendor landscape and to motivate the ecosystem towards achieving a 100% score. But if you need to understand behavioral differences in more depth, please take a look at the full test results and read on for the details.

Criteria for including a vendor in tests

We are always on the lookout for new PromQL implementations to include in these tests. To include a project or vendor, we look for two criteria:

Someone is either advertising full or partial PromQL compatibility of their system, and/or otherwise positioning their solution as a drop-in replacement for an existing PromQL implementation.
For practical purposes, we need to be able to ingest test data into the system and query it back via a PromQL-style HTTP API. Otherwise, we can't run any tests.

Unfortunately, some vendors failed the second criterium for now, so we couldn't include them:

Sysdig advertises PromQL support, but has no remote_write protocol support to ingest arbitrary test data. According to contacts at Sysdig, remote_write may be supported at some point in the future, but we couldn't get a recent update on its availability timeline.
Wavefront advertises partial PromQL support, but they lack an HTTP endpoint against which to run test queries. Their PromQL support is only available within their own user interface so far, which makes it difficult to test against. However, judging by the various compatibility caveats listed in their documentation, we do not expect that they would reach a high test score if we could test them.

We hope to be able to test even more vendors in the future.

UPDATE: After publishing, we received questions about PromQL support in two more vendors:

InfluxDB once started implementing PromQL support (coincidentally with the help of the author of this blog post), but as far as we are aware, this effort has been paused and PromQL support in InfluxDB is neither currently available nor is it being advertised.
Elastic recently advertised PromQL support in Elastic Metrics. However, the announced feature is only about Elastic Metrics being able to run queries against an existing external PromQL endpoint (such as Prometheus) to then ingest the query result as metrics into Elasticsearch. There is no way to run PromQL on the resulting data stored in Elasticsearch itself.

Minor updates to the test query set

Since the last testing round in August, we made minor changes to the test query set:

For queries that looked at the absolute values of counter metrics (e.g. with no rate() or similar counter-related function applied), we changed the selected metrics to be gauges instead. This helps keep test cases somewhat more independent for a vendor like New Relic, which does not support storing or retrieving absolute values for counters.
We added a query selecting an intermittently present series to test whether staleness handling works.

Let's get started: Updated comparisons

In this round of tests, we tested the following projects and vendors, in alphabetical order:

Chronosphere - A hosted monitoring and observability platform.
Cortex - An open-source, horizontally scalable reimplementation of Prometheus.
Grafana Cloud - A hosted monitoring and observability service.
M3 - An open-source metrics engine and time series database.
MetricFire - A hosted monitoing and observability service.
New Relic - A hosted monitoring and observability service.
Thanos - An open-source project to provide query aggregation for long-term storage and HA on top of Prometheus.
Promscale (TimescaleDB) - An open-source project to store Prometheus data in TimescaleDB.
VictoriaMetrics - An open-source time-series database and monitoring system.

Let's look at the test results for each of these systems:

Chronosphere

Chronosphere does not have a self-service account creation system yet, but they helpfully provided us with a test account with remote_write and PromQL HTTP API support. Chronosphere is built on top of M3 (tested below), so we expected similar test results for the two. After an initial test run against an outdated test environment using an older M3 version, Chronosphere updated our test account and we achieved a test score of 100% against it.

Cortex

Last time we tested Cortex we encountered two minor issues: floating-point input timestamps were not rounded in the same way as in Prometheus, and the system did not support executing queries without a metric name. The first issue has been resolved entirely, while the second one is being addressed by a new storage mode in Cortex - the new blocks storage mode vs. the older chunks storage mode.

Let's look at version 1.5.0 of Cortex and see what scores each of these modes received:

Cortex with chunks storage

When testing Cortex with the old chunks storage, we can now remove the query tweak to align input timestamps, since that bug has been fixed. Other than that, we still get the same test score of 99.62% due to the chunks mode not supporting queries without metric names.

Cortex with blocks storage

Cortex with blocks storage mode now also does not require any query tweaks anymore and passes with a score of 100%.

Grafana Cloud

Grafana Cloud is a new test candidate that advertises full hosted Prometheus functionality. Under the hood, Grafana Cloud uses Cortex to offer this service. According to representatives from Grafana Labs, they are in the process of phasing out the last (shared) chunks-based Cortex cluster, while all new Cortex clusters are already using blocks as their storage mode.

To be able to cache PromQL queries, Grafana Cloud aligns incoming query start and end timestamps to the query resolution step width (10 seconds in our case). This is currently not configurable on the Grafana Cloud side, so we had to apply a cross-cutting query tweak to align input timestamps to our reference Prometheus server in the same way, to get comparable results.

With this in mind, we tested a Grafana Cloud account via remote_write and their Prometheus-style HTTP query API on both the remaining legacy cluster and a new dedicated blocks-storage cluster, and we got the same 99.62% and 100% test scores as for Cortex itself. Since the chunks-mode cluster is being retired this month, we count that as an overall 100% score for Grafana Cloud.

M3

M3 is also a new candidate in this group that states, "M3 can be used as Prometheus Remote Storage and has 100% PromQL compatibility", on its homepage. Its documentation also mentions Prometheus and PromQL integrations that include remote_write support and a PromQL HTTP API endpoint. Using these with M3 version v1.0.0, we ran our test suite and received a test score of 100%.

Note: For PromQL to work 100% the same as in Prometheus, we had to ensure that our test query was only hitting raw, non-aggregated data in M3 (vs. data that has been downsampled into a lower resolution). We did this by not configuring any aggregation in our test M3 database.

MetricFire

MetricFire, a hosted service by the people behind hostedgraphite.com, is still pretty new in the Prometheus space. While MetricFire documents how to write data into the system using the remote_write protocol, there is no officially supported way to run external PromQL queries against the collected data using an HTTP API yet. However, MetricFire representatives kindly explained a way to do this. We won't detail it here, but we expect that there will be a documented public HTTP API soon. In our talks with MetricFire we learned that they also use Cortex to back their service, and we also managed to point out minor issues around their HTTP API, which were quickly resolved.

MetricFire is still using a slightly outdated version of Cortex, so we ran into similar input timestamp parsing inaccuracies as with Cortex in August, and queries without a metric name are not yet supported due to using chunks-based storage. With a cross-cutting input timestamp query tweak applied, MetricFire finally achieved a score of 99.62% in our tests. Hopefully, a Cortex upgrade will resolve the remaining timestamp and metric name issues automatically in the future.

New Relic

Disclosure: The author of this blog post has been consulting for New Relic as a freelance contractor (not currently as part of PromLabs) on working towards improved PromQL support.

New Relic is a hosted observability platform that has recently released features that allow writing Prometheus data into New Relic and querying it via a PromQL-style API. The documentation advertises PromQL support and the ability to use New Relic as a Prometheus datasource for existing Grafana dashboards. Further documentation explains differences in implementation and chooses to use the term "PromQL-style" rather than "PromQL".

An overall challenge for New Relic's PromQL support is that it is based on a transpilation approach, where any PromQL query has to be translated into New Relic's existing query language, NRQL. Not only does NRQL work very differently from PromQL, the underlying data model is also not always possible to reconcile with PromQL's needs. For example, New Relic differentiates metric types (counter, gauge, etc.) on the database level and stores counter metric samples only as deltas (relative increments to the previous sample). Prometheus's TSDB does not have a concept of metric types and stores absolute values for all samples. To determine that an incoming metric from Prometheus is a counter and subsequently transform its absolute value into a delta, New Relic currently applies a heuristic that treats metrics with suffixes such as _total, _count, _sum, or _bucket as counters, and other metrics as gauges. This is not perfect and can miscategorize counters as gauges, and vice versa. Recent changes in Prometheus to allow propgation of metric metadata via the remote_write interface will hopefully improve this situation somewhat.

In our testing, we encountered a large number of behavioral differences in comparison to Prometheus. Here are some examples:

Binary operators between two vectors are not yet supported with any vector matching or grouping options.
Range vector selectors (like my_metric[5m]) ignore the time range that is passed into them and always select a fixed (short) time range. Thus functions like rate(), resets(), delta(), etc. may return completely different results relative to Prometheus. UPDATE: Initially we interpreted the slight difference in output values as a lack of range vector duration support (which was the case for New Relic once), but range vectors with user-specified durations are supported now. Thanks to New Relic for finding this mistake!
Querying for special float value literals like NaN or Inf results in a query for a corresponding metric name, rather than returning that float value.
Selectors that don't match anything (and return an empty result in native PromQL) return a single time series with the sample value 0.
Staleness handling of intermittently present metrics is not supported.
Several numeric functions or operators (e.g. max(), min(), and quantile()) return either slightly or significantly different numeric sample values.
Many query types and functions still return "HTTP 501 Not implemented" errors.
Scalar-vector binary arithmetic operators apply the operands in the incorrect order (e.g. 42 / foo results in the same value as foo / 42). UPDATE: As a contact at New Relic pointed out, this is not actually true, and we too quickly misinterpreted some difference in sample values. The actual reasons for differences in some of the binary operator cases are either different interpretations for certain operators (like % with floating-point values), precedence, or other not fully explored reasons.
Operator precedence works a bit differently (e.g. -1 ^ 2 resulting in 1 vs. -1).
...and multiple other behavioral differences.

In our testing, New Relic received an overall score of 31.05%.

We do want to thank New Relic for contributing support for custom HTTP headers, a TSV output mode, and treating unsupported queries separately in results, to the PromQL Compliance Tester.

Promscale (TimescaleDB)

Last time we tested TimescaleDB PromQL support via its timescale-prometheus adapter. To give this adapter more visibility, Timescale (the company behind TimescaleDB) spun it out into a new project named Promscale. Thus, we tested Promscale v0.1.2 this time around.

We are happy to say that Promscale also received a 100% passing score in our tests.

Thanos

For Thanos, we tested version v0.14.0 last time and the latest version v0.17.0 this time around. In August, Thanos' only issue was the same input timestamp parsing bug as Cortex had. However, this bug has been fixed in Thanos as well by now, and thus it passes 100% of all tests now without any query tweaks.

VictoriaMetrics

For VictoriaMetrics, we tested version v1.39.1 in our last round, and we are looking at version v1.47.0 this time around. Their documentation site still says "MetricsQL is backwards-compatible with PromQL" before listing known deviations that relativize this statement further down in the document. In our previous testing round we filed multiple issues for discrepancies that we found between Prometheus' and VictoriaMetrics' behavior, some of which have been fixed. Other behavioral differences are intentional in VictoriaMetrics. A summary of what we know has happened since our last round:

The issue we filed about label_replace() not raising an error even for invalid target label names was closed because MetricsQL does support arbitrary characters in label names. That makes sense from the perspective of a language that aims to be a superset of another.
The issue about incorrect sample values in scalar-to-vector filter operations got fixed in VictoriaMetrics v1.39.4.
The issue about transformations that alter the meaning of a series not dropping the metric name first got fixed, but then the fix got reverted again. We do not agree with the VictoriaMetrics' interpretation that the metrics have the same meaning after the listed transformations, but independent of anyone's interpretation it is a behavioral difference to native PromQL.
The issue about unary operator precedence was fixed in VictoriaMetrics v1.39.4.

Other than that, most issues remain the same as in our last round of tests. We did discover additional brokenness in behavior (that didn't occur with other vendors) when running the compliance tester over a range of time that was so far only partially populated with test data (because the ingester hadn't been running for a long time yet): boolean comparison operators between an instant vector and scalar (like demo_memory_usage_bytes > bool 1.2345) returned 0-valued samples instead of empty results for resolution steps at which the metric demo_memory_usage_bytes didn't exist. This was also true for time-related functions such as day_of_month(), minute(), and others, when applying an offset to the input data. Given that it is a common real-world situation for query ranges to span beginnings and ends of time series (or intermittent series in general), we believe this is relevant for the overall result.

Given this, VictoriaMetrics achieved a 59.43% overall score.

Recommendations for implementers

When studying (and working on) different implementation types of PromQL, a pattern of three levels of increasing difficulty emerges:

Implementations that reuse the native PromQL engine code on top of a compatible database system have by far the easiest and most reliable path towards good compatibility, as the interface to the underlying storage subsystem is relatively simple and easy to replicate.
Implementations that rewrite the PromQL engine from scratch are way more challenging to get right, as there are many subtle (but often important) behaviors encoded into the native implementation.
Implementations that transpile PromQL into an existing query language are the hardest to build. Most other query languages are so different from PromQL that a correct mapping of one language into the other becomes infeasible even with custom-built extensions in the target query language.

Thus, we highly recommend organizations and developers to reuse the native engine code whenever possible.

Future work

Much of the previous round's "Future work" section still applies. For repeatability, it would be especially great to put all the above-mentioned manual testing into code and automate it more. And for anyone who would like to help improve the testing tool, please file issues or pull requests against the PromQL Compliance Tester repository!

Conclusion

There is a lot of movement in the PromQL space. Existing projects and vendors are improving their PromQL implementations, while new ones are entering the space. There is a lot of variability in their quality of PromQL support, and we hope that our ongoing testing efforts help by shining a light on these differences. We believe that compatibility with the native PromQL API is important for the Prometheus ecosystem, and we thus also encourage vendors to work towards becoming fully compliant when using the term "PromQL" to advertise their systems. We hope to see you again soon for another round of testing!

To view all test results, head to our PromQL Compliance Tests page, also available from our "Resources" menu item at the top of the page.

We thank Grafana Labs for supporting our PromQL compatibility testing work with funding. We do our best to provide neutral and objective compatibility testing results, independent of any such support.

December 01, 2020 by Julius Volz

Tags: promql, compliance, testing, vendors, cortex, thanos, victoriametrics, promscale, timescaledb, metricfire, newrelic, m3, grafanacloud, chronosphere