Combating alert fatigue at Cloudflare

In an in-depth blog post, Cloudflare’s Monika Singh explores the stressful environment faced by on-call staff, often illustrated by the “It’s OK” meme. On-call staff often deal with numerous alarms, leading to alarm fatigue – a state of exhaustion caused by responding to unprioritized or unclear alarms. To combat this, Cloudflare teams regularly conduct alarm analytics to improve the accuracy and actionability of alarms. Singh’s blog post covers the importance of alarm monitoring and Cloudflare’s methods for improving it using open source tools and best practices.

Singh explains how alarm fatigue can affect on-call staff’s sleep, social life and leisure activities, potentially leading to burnout. Regular alarm analysis helps mitigate this by reducing unnecessary interruptions and improving the efficiency of on-call work. Despite its importance, not all teams perform alarm analysis. Singh stresses that analyzing alarms helps staff create handover notes, helps managers assess burnout risks, and supports incident report writing.

In the post, Singh goes over the basics of the Prometheus architecture at Cloudflare. Cloudflare relies heavily on Prometheus to monitor its devices, which are located in over 310 cities and run on over 1100 servers. Alertmanager centralizes alerts and uses a webhook to store alerts for analysis. Prometheus collects metrics, evaluates rules, and triggers alerts, which Alertmanager manages.

Alertmanager handles alerts by suppressing, grouping, muting, or forwarding them depending on the configuration. However, not all alerts are optimally configured, which causes disruptions. Cloudflare originally used alertmanager2es to monitor and report alerts, but it had limitations as it did not notify the team about muted or suppressed alerts. Singh highlights how Cloudflare got around this limitation by querying the Alertmanager API to capture all alert states.

Alarm analysis scheme

Cloudflare aggregates all alarm states into a data store by correlating data from the Alertmanager webhook and API and linking it to a unique fingerprint field. Singh describes how the data is transformed using vector.dev (a pipeline for routing and transforming observability data) and stored in ClickHouse (an open source database for analytics) for analysis. ClickHouse enables efficient data manipulation, for example by enabling specific label queries and aggregating alarm data.

Singh describes creating several dashboards at Cloudflare to monitor alerts, including:

Overview of alerts: General insights into the alerts received by the Alert Manager.

Alertname overview: Detailed analysis of specific alerts.

Overview of alarms by recipient: Insights specific to teams or recipients.

Alarm status timeline: Snapshot of alarm volume and activity.

Jiralerts Overview: Alerts received from the ticket system.

Overview of Silence: Insights into the silence of the alert manager.

Singh further explains that they route alerts to teams and that a team can have multiple services or components, giving many possible combinations of alerts. A dashboard panel aggregates the number of alerts triggered over time to show on a simple dashboard which components are noisy and when. Additionally, a swimlane view of recipients shows on a color-coded dashboard how busy the on-call service has been and when to highlight flapping alerts – where the status changes frequently. This helped the team reconfigure and optimize the thresholds and duration periods in the alert rules.

The analysis revealed that some alarms were triggered without notification labels and some were from decommissioned clusters from which the alarms had not been removed. Additionally, the Alert Manager’s inhibits sometimes failed, resulting in unnecessary alarms. By storing the alarm data in ClickHouse, Cloudflare was able to identify and fix the configuration errors that caused these issues. The Alert Manager also allows alarms to be silenced during maintenance or while being worked on. By analyzing the mutes, Cloudflare was able to identify outdated mutes that were no longer relevant and ensure that the alarms were always relevant.

Singh provides a demo on how to implement alarm monitoring using Docker Compose, Prometheus, Alertmanager, Vector, ClickHouse, and Grafana. This setup allows users to explore pre-built demo dashboards and understand the alarm monitoring process used at Cloudflare.

Singh concludes that monitoring alarms increases on-call efficiency and prevents burnout by minimizing unnecessary interruptions. Cloudflare’s approach has improved alarm management and provides valuable insights for troubleshooting and optimizing alarm configurations. This proactive monitoring culture benefits all teams and creates a more manageable environment for on-call engineers.

Related Posts

West Richland Police Launch Rent-A-Pup Program | News

FBI: Dead serial rapist Walter Leo Jackson linked to unsolved murder cases from almost 30 years ago

One dead in two-vehicle crash near Yankton – Mitchell Republic