The Meaning of "Prometheus" - A Tale of Implementations and Interfaces

October 13, 2020 by Julius Volz

What people mean when they say that they "use Prometheus" or "support Prometheus" is evolving over time. Initially, the Prometheus landscape was very focused on a specific set of component implementations, whereas nowadays the focus is shifting more towards component interfaces and interoperability. Let's have a closer look at what this means, why this is happening, and in which ways this may be good or bad.

Prometheus as an implementation

When the Prometheus Team initially announced Prometheus to the world in 2015, we focused a lot on specific implementations of servers and libraries that together formed a Prometheus deployment. You could draw an architecture diagram of a fairly complete setup in this way:

Prometheus component implementations

Initially all boxes in this diagram would be using binaries or at least client libraries provided by the official Prometheus project:

Service targets were using one of the official Prometheus client libraries to track and expose metrics.
The official Prometheus server was collecting and processing all data.
Querying via PromQL would always happen via the Prometheus server as well.
The official Alertmanager server was processing all alerts and sending notifications.
There was no remote storage integration yet, but it is outlined above because we will be speaking about it later.

In general, people focused a lot on the Prometheus server's architecture, its TSDB implementation and efficiency, the operational simplicity and Go-based implementation, service-discovery integration to enable monitoring of dynamic environments, how to achieve highly available deployment setups with the vanilla Prometheus server, and so on. The same was true for the Alertmanager and other core components.

Prometheus as a set of interfaces

Fast forward to 2020. When people say that they "do Prometheus", it is harder to determine what they exactly mean, since the focus of the word has shifted more towards the set of interfaces connecting the boxes in the diagram, than on the boxes themselves:

Prometheus interfaces

It is clear that any box in the diagram can be replaced by a different implementation, as long as the boxes still speak the same language. Let's look at what this means for the various components and interfaces:

Exposition

For exposing metrics to a Prometheus server, the only thing that really matters is to implement an HTTP endpoint that serves the Prometheus metrics exposition format:

Prometheus exposition interface

While the official Prometheus client libraries are still the most popular way to output Prometheus-compatible metrics from a process, there are now many third-party libraries as well, or even people writing ad-hoc custom serialization code, since the format is so simple (and simplicity of generation has always been the intention!). A slight evolution of this metrics format is also being standardized as part of the OpenMetrics project, which the Prometheus server can already scrape, and which OpenTelemetry will support as an output format.

Metrics collection

On the other end of that same interface, the Prometheus server usually collects metrics from the scraped target:

Prometheus exposition interface

There are many third-party solutions that support collection of metrics in this way, as the format is simple and there are multiple available parsers.

Believe it or not, some "Prometheus"-style setups these days don't even include a Prometheus server, but just run an agent that scrapes metrics in the Prometheus format and forwards them to a remote storage system. One prominent example is the Grafana Agent that collects and forwards metrics using the remote write protocol (see further down), without even storing any metrics locally. The same is true for agents of some other monitoring SaaS providers, like the Sysdig Agent and the Datadog Agent. Unfortunately, one benefit of the Prometheus server that these scraping alternatives don't always retain is Prometheus's excellent integration with service discovery and the detailed labeling of targets that it enables.

PromQL

The Prometheus Query Language (PromQL) is the core piece of Prometheus that enables ad-hoc diagnostics, graphing, and alerting, all using a single unified language. It is usually served on a set of HTTP endpoints in the Prometheus server:

Prometheus PromQL interface

Many open-source projects and vendors now say that they support PromQL in their custom storage systems, while the actual level of compatibility varies. We previously looked at the PromQL compatibility of different implementations in more detail in another blog post and will have a follow-up post to share soon. Compatibility for PromQL is important as it is the key way for working with data in Prometheus. Most importantly, if alerting rules break in undetected ways due to an incompatible implementation, the cost to the user may be significant.

PromQL is by far the largest and trickiest interface to reimplement from scratch. Not only does it consist of many different functions, operators, and syntactical constructs, but it also has a lot of subtle execution behaviors that were either carefully designed or got introduced by historical accident. In practice we have seen the highest level of compatibility in projects that reuse the upstream Prometheus PromQL engine code and only replace the storage integration.

Remote write and remote storage

The remote write protocol allows Prometheus to forward scraped data to a remote storage system:

Prometheus remote write interface

Even more intentionally than for the other interfaces, the purpose of this protocol has always been to enable a large ecosystem of third-party implementations accepting data on the remote side. Those third-party implementations may then choose different tradeoffs for the storage and processing of metrics, especially providing better durability and horizontal scalability than the vanilla Prometheus server.

By now there are many systems accepting the remote write protocol (and some also the remote read protocol, a way to read back samples via Prometheus). Cortex and Promscale are two examples, with a larger list documented on the Prometheus website.

The remote write protocol has probably been the biggest enabler for commercial SaaS providers to support Prometheus features, as it allows ingesting on-premise data into a vendor's storage cloud.

Alert dispatching

When Prometheus determines that an alert should fire, it sends an ongoing stream of notifications about this to the Alertmanager:

Prometheus alerting interface

This interface is not reimplemented much (although it has been documented), and even the few replacements for Alertmanager tend to not be full reimplementations. This is due to the fact that this protocol is fairly low-level and puts a large burden on the receiver implementation: alerts are sent repeatedly as long as they are still firing, and the receiver (normally Alertmanager) carries the responsibility of aggregating the raw alert stream over time and across label sets, and then routing notifications to their destinations. Usually the Prometheus Team does not encourage full reimplementations of this protocol, especially on the receiving side.

Other interfaces

Besides the interfaces discussed above, there are a number of other integration points to be aware of:

The Alertmanager webhook receiver allows building custom notification mechanisms that are not supported out of the box, and there are a myriad of integrations for it.
Alerting rule and recording rule files and the evaluation of the rules contained therein are reimplemented in some third-party systems like Cortex.

The file-based service discovery mechanism in Prometheus allows implementing external custom service discovery mechanisms.

We won't discuss these interfaces in detail here, but they show more flexible ways for people to openly integrate with Prometheus.

Why is this happening?

The reorientation around interfaces and their implementation in third-party systems is happening for a variety of reasons. The most obvious ones are:

People are looking for different tradeoffs than the ones in Prometheus itself. For example, Cortex aims to be a horizontally scalable clustered storage system, whereas a vanilla Prometheus server explicitly aims to be simple and does not support clustering.
People need to integrate with existing systems, and it's often not possible to replace all parts of an existing monitoring infrastructure at once.
Existing monitoring vendors want a piece of the cake, and in a world in which Prometheus is the de-facto standard for open-source monitoring, it is immensely valuable to companies both from a marketing perspective and also for their customers to offer Prometheus support.

The above reasons also often mean that only some of the boxes in the architecture diagram get replaced, while others remain vanilla upstream Prometheus implementations. Thus the official Prometheus components are likely still the most popular overall.

What do you need to watch out for?

It is encouraging to see such a huge adoption of Prometheus's interfaces across the entire monitoring world. At the same time it does mean that the landscape has become more challenging to understand and to keep honest:

When an engineer or marketing person claims "Prometheus support" for their system, ask them which of these interfaces they actually mean, and whether you are happy with the specific tradeoffs that their compatible reimplementation is choosing.
When someone claims specific support for an interface like PromQL, how compatible are they really?
When choosing alternative interface implementations, you will generally need to do more research about what options are available to you.

Besides these words of caution, we think that it is a huge net win that people are creating a vibrant ecosystem around Prometheus's open interfaces! We should just aim to maintain as much clarity and compatibility as possible.

Conclusion

In recent years, the focus has shifted from a specific set of Prometheus component implementations to the open interfaces that connect them. This has enabled all kinds of flexible re-combinations of monitoring architectures with different tradeoffs and better integration into existing ecosystems. Hopefully this will also lead to greater longevity of those interfaces, as they are now becoming a cornerstone of the world's monitoring infrastructure. Just keep in mind that this open landscape will also require a bit more research by users as to which specific technologies with (claimed) Prometheus support they should use, and when. In our role as part of the Prometheus Team, we are the shepherds of the Prometheus ecosystem and need to give more guidance on this, both to avoid overly broad marketing claims and to support the projects and companies actively wanting to support Prometheus-style monitoring. In our role as PromLabs, we are happy to help customers navigate this space and aid vendors in building correct implementations of these interfaces.

October 13, 2020 by Julius Volz

Tags: prometheus, promql