Prometheus Monitoring Guide Part 3 - Alerting
06 Jul 2021- Part 1 - Installation and Instrumentation
- Part 2 - PromQL and Recording Rules
- Part 4 - Pushgateway and Blackbox Exporter
Alerting
Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service. Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active for these elements’ label sets.
Alerting rules are configured in Prometheus in the same way as recording rules:
# rules/alert_rules.yml
groups:
- name: example
rules:
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
labels:
severity: page
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
- for - wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element
- labels - specifying a set of additional labels to be attached to the alert
- annotations = informational labels that can be used to store longer additional information such as alert descriptions or runbook links
rule_files:
- "rules/alert_rules.yml"
Alerts can be monitored through the “Alerts” tab in the Prometheus dashboard (which ones are active, pending, firing etc)
AlertManager
- another layer is needed to add summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions
- Prometheus is configured to periodically send information about alert states to an
Alertmanager
instance, which then takes care of dispatching the right notifications- takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty
- provided as a single Go binary from https://prometheus.io/download/ so can be executed directly
./alertmanager
- by default runs on port9093
- or with Docker
docker run --name alertmanager -d -p 9093:9093 quay.io/prometheus/alertmanager
- takes configuration from
alertmanager.yml
file in same directory
alerting:
alertmanagers:
- static_configs:
- targets:
- "localhost:9093"
Sending Email Notifications
alertmanager.yml
file defines routing tree defining how an alert should be managed. If no labels are matching, default root is used
route:
receiver: admin
receivers:
- name: admin
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: smtp.gmail.com:587
auth_username: "[email protected]"
auth_password: "abcdefghijklmnop"
Routing Tree
- Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
- configure Alertmanager to group alerts by their cluster and alertname so it sends a single compact notification.
- Inhibition is the concept of suppressing notifications for certain alerts if certain other alerts are already firing
- Silences are a way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert. Configured through the UI. If time based, add condition to the underlying rule instead
- By default each alert running through the routing tree will halt after matching against the first receiver at the same level - can use
continue
clause
route:
receiver: admin # root fallback
group_wait: 2m # how long to wait for other alerts in a group to fire before notifying (after initial)
group_interval: 10s # how long to wait before sending a notification about new alerts added to an already firing group
repeat_interval: 30m # how long to wait before sending a notification again if it has already been sent
routes:
- match_re:
app_type: (linux|windows) # custom label specified in the rule definition file
receiver: ss-admin # fallback receiver
group_by: [severity] # group all alerts on a label to send compact notification
routes:
- match:
app_type: linux # match on more specific label
receiver: linux-teamlead # target more specific receiver
routes: # nested routes on different labels
- match:
severity: critical
receiver: delivery-manager
continue: true
- match:
severity: warning
receiver: linux-teamlead
- match_re:
app_type: (python|go)
receiver: pec-admin # fallback receiver
routes:
- match:
app_type: python
receiver: python-team-admin # fallback receiver
routes:
- match:
severity: critical
receiver: python-team-manager
- match:
severity: warning
receiver: python-team-lead
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning" # mute warning alert if critical alert already raised in same app and category
equal: ["app_type", "category"]
receivers:
- name: linux-team-lead
email_configs:
- to: "[email protected]"
Checking Tree Syntax
To quickly check whether an alerting route treefile is syntactically correct without starting the AlertManager instance, you can use the amtool
utility:
amtool check-config alertmanager.yml
Or https://prometheus.io/webtools/alerting/routing-tree-editor/ can be used to visualize a routing tree