Test target | Passing | Cross-cutting issues | More details |
---|---|---|---|
Amazon Managed Service for Prometheus (hosted) | 596 / 596 cases (100.00%) | 1 | Detailed results |
Chronosphere (hosted) | 596 / 596 cases (100.00%) | None | Detailed results |
Cortex (v1.10.0) | 596 / 596 cases (100.00%) | None | Detailed results |
Google Cloud Managed Service for Prometheus (hosted preview) | 517 / 548 cases (94.34%) | None | Detailed results |
Grafana Cloud (hosted) | 548 / 548 cases (100.00%) | None | Detailed results |
M3 (v1.3.0) | 596 / 596 cases (100.00%) | None | Detailed results |
New Relic (hosted) | 162 / 596 cases (27.18%) | 2 | Detailed results |
Promscale (0.6.2) | 596 / 596 cases (100.00%) | None | Detailed results |
Sysdig Monitor (hosted) | 594 / 596 cases (99.66%) | 1 | Detailed results |
Thanos (v0.23.1) | 596 / 596 cases (100.00%) | None | Detailed results |
VictoriaMetrics (v1.67.0) | 442 / 596 cases (74.16%) | 1 | Detailed results |
Time for round three of our PromQL compatibility tests! In two previous blog posts, we compared PromQL correctness across vendors and then provided an update a few months later. Using an open-source compliance checker tool, we evaluated each implementation by running a set of test queries against both the native Prometheus server and the vendor implementation. As a result, we found multiple bugs in the tested projects and mapped out in detail how each deviated from the upstream implementation.
By now, even more vendors have entered the PromQL space and existing projects have evolved. Today, we are presenting an updated set of PromQL test results both for existing projects and a couple of new players. To allow for some leeway, we only tested for features that were available at least two Prometheus minor versions ago (in Prometheus v2.28.0). This is in line with what the official Prometheus Conformance Program requires of vendors to be certified as PromQL-compliant. To be clear, the results in this blog post do not represent official Prometheus Conformance Program results, which will be discussed by Richard "RichiH" Hartmann in his KubeCon Talk later today.
General notes on test scores and vendor inclusion
The general notes on interpreting test scores and inclusion of vendors from the last testing round still apply. Please note especially that the numeric test scores (while good to have) are not alone a great indicator of the quality of test failures. Manual interpretation is needed to understand the actual impact, although full Prometheus compatibility requires a 100% test score without any cross-cutting issues in any case.
Updates to the test query set
Since the last testing round, we added some test cases to the suite, both for missing old features and for newly added PromQL features:
- Tests for the
absent()
,last_over_time()
,absent_over_time()
,clamp()
, andsgn()
functions. - Tests for the
count_values()
aggregator. - Tests for composite duration literals (e.g.
1h5m15s
).
We did not yet add any tests that expect experimental PromQL features that need to be enabled using a feature flag in Prometheus (like negative offsets or the @
-modifier), as we don't see these features as an official part of the language yet until they become stable and enabled by default.
Let's get started: Updated comparisons
In this round of tests, we took a look at the following projects and vendors, in alphabetical order:
- Amazon Managed Service for Prometheus - A hosted Prometheus-style service by AWS.
- Chronosphere - A hosted monitoring and observability platform.
- Cortex - An open-source, horizontally scalable reimplementation of Prometheus.
- Google Cloud Managed Service for Prometheus - A hosted (preview) Prometheus-style service by Google.
- Grafana Cloud - A hosted monitoring and observability service.
- M3 - An open-source metrics engine and time series database.
- New Relic - A hosted monitoring and observability service.
- Promscale (TimescaleDB) - An open-source project to store Prometheus data in TimescaleDB.
- Sysdig Monitor - A hosted monitoring and observability platform.
- Thanos - An open-source project to provide query aggregation for long-term storage and HA on top of Prometheus.
- VictoriaMetrics - An open-source time-series database and monitoring system.
- Wavefront by VMWare - A hosted monitoring and observability platform.
For Wavefront, we are only publishing a preliminary look and no full test scores for now, as we didn't manage to get access to a full and up-to-date test environment in time for this blog post.
Note that we are no longer including MetricFire in our tests, as they have discontinued their Prometheus service offering for the time being.
Let's look at the test results for each of these systems:
Amazon Managed Service for Prometheus
As a new entrant to the Prometheus space, AWS launched their Cortex-based Amazon Managed Service for Prometheus (AMP) shortly after our last round of tests. While testing, we noticed that AMP modifies incoming query start and end timestamps to align them to a multiple of the resolution step to achieve better cacheability (similar to Grafana Cloud initially). Unfortunately, this slightly modifies the query semantics, requiring a query tweak as a workaround for further comparisons. The AMP team has been made aware of this issue and we are hoping for them to deactivate the timestamp alignment in the future to achieve full compatibility.
With a query tweak applied to only send step-aligned test queries, AMP achieved a test score of 100%.
Chronosphere
Similar to last time, Chronosphere provided us with a test account to their service and again achieved a test score of 100% without requiring any query tweaks for cross-cutting issues.
Cortex
This time around we opted to ignore the deprecated chunks storage mode in Cortex and only tested the blocks storage mode. Cortex again received a test score of 100% without any cross-cutting issues for this case.
Google Managed Service for Prometheus
Google just launched a preview of its Google Cloud Managed Service for Prometheus (GMP) a few days ago, and we already managed to get early access. According to contacts at Google, the service is internally backed by Google's Monarch monitoring system. It mostly uses Prometheus's own query engine code to compute PromQL results, but it replaces certain PromQL query subtrees entirely by native Monarch evaluations (such as rate()
and *_over_time()
function computations, as well as dimensional aggregation operators). To ingest predictable test data into the service from a local machine, we used Google's Prometheus fork, as GMP does not support Prometheus's native remote_write
protocol for ingestion yet.
When running the tester against Google's service, we observed a number of issues:
- Staleness handling is not yet supported, so metrics that end mid-air in Prometheus still get returned for a period of 5 minutes in GMP. According to Google, staleness handling support is planned for 2022.
- The calculation of the
rate()
function is entirely delegated to Monarch's own rate calculation implementation, which produces slightly different results. - Some newer PromQL features (like the
group()
aggregator, newer functions likesgn()
,clamp()
, etc., and composite durations) are not supported yet. This should be fixed "automatically" as Google updates its internal PromQL engine dependency. - The function
last_over_time()
drops the metric name in GMP, but doesn't in Prometheus. - Initially, some queries failed randomly with the error
"parse error: unknown node type: *parser.MatrixSelector"
, but Google remedied this problem by taking some internal infrastructure measures. - ...and a few other minor query differences.
Overall, Google's GMP achieved a test score of 94.34% without any query tweaks.
Grafana Cloud
Similar to last time, the Cortex-based Grafana Cloud service still initially aligned the query start and end timestamps to the resolution step to achieve better cacheability of queries and required a query tweak to work around this behavioral difference. However, Grafana Labs corrected this issue shortly before we published this blog post. After that fix, we got a test score of 100% for Grafana Cloud without any cross-cutting issues.
M3
For M3, we tested version v1.3.0 this time. Similar to last time, M3 again received a test score of 100% without any cross-cutting issues.
Note: For PromQL to work 100% the same as in Prometheus, we had to ensure that our test queries were only hitting raw, non-aggregated data in M3 (vs. data that has been downsampled into a lower resolution). We did this by not configuring any aggregation in our test M3 database.
New Relic
We tested New Relic at a score of 31.05% last time, and it seems like the same issues discussed then are still valid today. Additionally, a few more queries are now failing due to our expansion of the test query set:
- Composite duration strings like
10m15s
are not yet supported by New Relic. - The
group()
aggregator,sgn()
function,clamp()
function, and several other new features are not supported yet.
In our testing, New Relic received an overall score of 27.18% this time.
Promscale (TimescaleDB)
This time we tested Promscale 0.6.2. Initially the tests uncovered a bug in the metrics caching layer of Promscale 0.6.1 that was promptly fixed by the Timescale team in Promscale 0.6.2. After this fix, Promscale again received a score of 100% without any cross-cutting issues.
Sysdig Monitor
Sysdig Monitor is a cloud-based monitoring service that advertises PromQL support. Since the last round of PromQL tests, Sysdig has added support for ingesting data via the Prometheus remote_write
protocol, which now allows us to test Sysdig's PromQL implementation with comparable reference data and include it in the official results. Note that we have been working with the Sysdig team on the evaluation and improvement of these test results, and they are sponsoring this part of the blog post as well (thanks!).
The PromQL tests uncovered a few compatibility issues with Sysdig Monitor:
Timestamp alignment
Sysdig's time series database currently only supports storing samples in a fixed 10-second grid for external agents like Prometheus, whereas Prometheus itself usually scrapes and stores samples at arbitrary unaligned timestamps with millisecond precision. To deal with this mismatch in expectations, Sysdig aligns all samples coming from Prometheus to this 10-second grid and discards any duplicate samples within an interval. To factor out the cross-cutting query differences resulting from this limitation, we patched the ingesting reference Prometheus server to align ingested samples to the same grid as Sysdig. We also added a query tweak in the tester tool to align query start and end timestamps to their 10-second resolution step. The Sysdig team is aware of this issue and is working towards providing more flexible sample timestamp support in the future.
Fixed bug around inequality matching
The tests originally found a bug in the handling of negative equality label matchers (!=
) that caused missing output series, which the Sysdig team promptly fixed.
Fixed bug around count_values()
aggregation
The tests originally found a bug in the count_values()
aggregator implementation, which caused the aggregator to act as if all input series had the same sample value of 1
(vs. many different ones). This was quickly identified and fixed by the Sysdig team.
Slight modulo operator deviation
Lastly, and more mysteriously, a few queries that involve the modulo operator (%
) result in miniscule floating-point differences as compared to Prometheus. Neither we nor the Sysdig team have found the reason for this deviation yet, but slight differences in floating-point modulo implementations on different processors could be one possibility. Hopefully the underlying cause can be identified soon, although it is unlikely to affect real-world use cases.
After applying the above-mentioned workarounds for timestamp handling, Sysdig Monitor achieved a test score of 99.66%.
Thanos
For Thanos, we tested version v0.23.1 this time, which again passed 100% of all tests without any query tweaks.
VictoriaMetrics
For VictoriaMetrics, we tested version v1.67.0 this time. With respect to the last round of tests, VictoriaMetrics seems to have fixed several issues in the meantime:
- Scalar literals in hexadecimal format (e.g.
0x3d
) are now parsed correctly. - The
quantile()
aggregator now returns exactly the same values for cases where it returned slightly different values before. - Boolean comparison operators between an instant vector and scalar (like
demo_memory_usage_bytes > bool 1.2345
) no longer return samples for resolution steps at which the metric does not exist. - And possibly more...
In a recent Medium post by VictoriaMetrics co-founder Roman Khavronenko, VictoriaMetrics also lists differences in their MetricsQL implementation and announced their intention to never be fully PromQL-compatible ("However, VictoriaMetrics is not 100% compatible with PromQL and never will be."). The article goes on to argue that the various behaviors in MetricsQL are preferrable to PromQL, a point that would likely not find complete agreement within the Prometheus Team. However, debating language design decisions is besides the point in this case, as we are looking solely at compatibility here - VictoriaMetrics still positions itself as a "drop-in replacement for Prometheus" and the MetricsQL description still says "MetricsQL is backwards-compatible with PromQL" before going on to incompatibilities for the more attentive readers.
In practice, we have repeatedly encountered users who did not read the fine print and had already adopted VictoriaMetrics, only to be surprised later on when certain PromQL functionality would not work as expected. Incompatibility also means that ecosystem resources stop working for affected users: PromQL proxies that parse and validate queries, PromQL editor language support, rule validation tools, and other software that works with and depends on exact language features. We thus encourage VictoriaMetrics to either create a PromQL-compatible language dialect or remove marketing language that causes casual readers to believe that it is both Prometheus- and PromQL-compatible. The same statement holds true for other vendors - VictoriaMetrics has just taken the most aggressive stance on intentional incompatibility so far, while at the same time capitalizing on the success of the Prometheus project, using the project's channels for (sometimes misleading) marketing, and brushing off any potential concerns relating to language features that work differently in MetricsQL.
Despite the above thoughts, VictoriaMetrics achieved a 74.16% overall score in our tests this time, which is an improvement over the last round.
Wavefront by VMWare
VMWare's Wavefront service advertises support for PromQL: "Wavefront supports both PromQL and WQL (Wavefront Query Language) queries". Talking to contacts at VMWare, this PromQL support is being built on top of Wavefront's existing query language, so it falls into the transpilation category (similar to New Relic's). At the time of this writing, we were only able to get access to an internal test setup running at VMWare that ingests reference data both into a Prometheus server and into Wavefront. However, the reference Prometheus server was still using Prometheus version v2.12.0 from 2019, so we were unable to run the current set of tests against it for comparisons. For this reason, we only ran an improvised set of old test cases (similar to the test cases from our previous blog posts) against the Wavefront setup and got a preliminary score of 65.22%, with various queries behaving differently (some in minor, others in major ways). We're in contact with engineers at VMWare and may publish more detailed test results for Wavefront in the future.
Future work
We single-handedly tested 12 different projects and vendors in this round, which frequently required a lot of back and forth to clarify and fix issues around test setups and compatibility quirks. While the results are valuable, we do not believe that this is a sustainable approach to assessing vendor PromQL compatibility in the future. The Prometheus project has recently announced a Prometheus Conformance Program that will allow projects and vendors to self-certify their solutions according to a set of guidelines. Within the Prometheus Team, we intend to allow (and require) automated testing where possible over time. We hope that the economic incentives around certified compatibility leads to more vendors assigning resources to improving the test suite as well. At PromLabs we plan to support this work and hope to make it as easy as possible for vendors to self-certify themselves soon.
Conclusion
A tweet by Jaana Dogan summarizes the current state of PromQL adoption across the industry well:
"PostgreSQL for relational. PromQL for monitoring. Two big alignments across the industry."
The monitoring world is largely converging on PromQL as the standard for time-series based monitoring. Even more vendors are offering PromQL-style services now than last time, with even two of the major cloud providers (AWS and Google) throwing their hat into the ring. The more PromQL becomes the "lingua franca" of monitoring, the more important interoperability becomes, and users will come to expect interchangeable services and portability. As we have seen, there is still a lot of variability in the quality of PromQL support between vendors, and often unfortunately a big gap between marketing claims and reality. We hope that our ongoing testing efforts help by shining a light on these differences. We believe that compatibility with the native PromQL API is important for the Prometheus ecosystem, and we thus also encourage vendors to work towards becoming fully compliant when using the term "PromQL" to advertise their systems.
To view all test results, head to our PromQL Compliance Tests page, also available from our "Resources" menu item at the top of the page.
At this point, we would like to thank our ongoing sponsors Timescale and Schwarz IT for supporting PromLabs' general Prometheus community and open-source work. While Timescale (via Promscale) is also a vendor that is being evaluated in the tests above, we do our best to provide neutral and objective compatibility testing results, independent of any such support.
Comments powered by Talkyard.