If you use Prometheus for mission-critical monitoring and alerting, you will want to make sure that your Prometheus stack is just as reliable as the systems and services it is monitoring. Let's take a look at the main strategies for achieving high availability for Prometheus-based monitoring setups.
Making a Prometheus server highly available for monitoring and alerting is simple: just run multiple Prometheus server instances with an identical configuration. Both servers then scrape the same metrics, compute the same alerting rules, and forward exactly the same alerts to the Alertmanager. The Alertmanager can then deduplicate alerts based on their label sets and only send you a single notification.
If you implement this strategy, just make sure that your Prometheus server instances run on separate machines, or that they are otherwise separated enough to satisfy your availability requirements. You can run as many replicas as you want, but most people are happy with two or three.
If you need to distinguish the replicas of a set when talking to external systems (like a remote storage system), you can give them different external replica label values like this:
replica: A # B for the second replica.
However, all alerts sent to the Alertmanager will then also have different replica labels. Since the Alertmanager dedupes alerts based on identical label sets, this deduplication will now break and you will get as many notifications as you have Prometheus server replicas! To avoid this, make sure that you drop the replica label on the alerting path using alert relabeling:
# Drop the "replica" label.
- action: labeldrop
Now the same alerts from different replicas in the set will look identical again and can be deduped.
Prometheus relies on the Alertmanager for routing, grouping, and throttling alerts. To ensure high availability for the Alertmanager as well, you can run multiple identical Alertmanager replicas in a clustered mode. After configuring all your Prometheus replicas to each send alerts to all Alertmanager replicas, the Alertmanager then uses a gossip-based protocol to replicate information about already-sent notifications to the other replicas in the cluster:
Now each replica computes on its own which notifications should be generated for the incoming alerts. However, when a replica sees that another replica has already sent out a notification for a given alert grouping, it will not send another one. This ensures that each notification is only sent once. To account for non-zero replication latency between the Alertmanager replicas, the replicas also establish an ordering (replica 1, replica 2, replica 3, ...) among each other and will wait a corresponding multiple of 15 seconds (0s, 15s, 30s, ...) before sending out a given notification. This ensures that if one replica is down or fails to communicate with the other replicas, the next replica will still send out the notification after 15 seconds.
Note that in the worst case, all Alertmanager replicas could be up and mostly working, but not be able to reach each other anymore for some reason (e.g. when the gossip protocol is misconfigured). You can then get as many duplicate notifications as you have replicas. This is a "fail-open" tradeoff that is made to ensure that the Alertmanager can still send out notifications even if the network between the replicas is down.
You can configure Alertmanager clustering using the
--cluster.peer flags. If you want to see a hands-on example of how to set
up an Alertmanager cluster, check out the
section of our
High Availability for Monitoring and Alerting
Of course there is much more to say about creating fully redundant and highly available Prometheus monitoring setups. Here's some other HA-related topics you may want to look into:
External / remote storage systems: Using remote storage integrations, Prometheus can send its collected metrics to a remote storage system such as Cortex, Thanos, or M3. These systems are designed for long-term storage, horizontal scalability, and high availability. By integrating Prometheus with remote storage systems, you can ensure the continuous availability of your monitoring data, even in the case of a Prometheus server outage.
You can learn more about using external storage systems in our Integrating with Remote Storage Systems training.
Monitoring Prometheus itself: In addition to implementing redundancy and failover strategies, it's crucial to monitor your Prometheus infrastructure for any potential issues. By monitoring Prometheus itself, you can detect and resolve problems before they impact the availability of your monitoring system. You can achieve this by setting up alerting rules and monitoring dashboards for key metrics such as instance uptime, query performance, and resource utilization.
You can learn more about meta-monitoring in our Monitoring and Debugging Prometheus training.
Ensuring high availability in your Prometheus monitoring infrastructure is crucial for maintaining reliable monitoring and alerting capabilities. By implementing redundancy, failover strategies, and monitoring Prometheus itself, you can minimize the risk of downtime and data loss. This will allow your organization to react quickly to any issues and maintain the stability and performance of your infrastructure.
If you want to learn all Prometheus fundamentals from the ground up, take a look at our Prometheus training courses to get going today.