Introduction
AlertManager is the routing layer of alerts for our monitoring stack at The Remote Company. It’s there to receive the alerts from Prometheus (or other alert processor, e.g. VictoriaMetrics VMAlert), match them to appropriate receivers and serve them to notification channels.
@TODO:
Configuration
The only configuration point for AlertManager is alertmanager.yml
. The main units in the configuration file are routes
and receivers
.
Routes
Routes are the part that actually sends the alert to a notification channel. The other function that routes serve is making routing decisions by matching severity levels and/or labels on the alerts received. Receiver configuration is mapped to routes, either to execute notifications or be evaluated further.
By default, receivers match all severity levels and and alert exits the routing tree on the first match. In case we want a triggered alert to notify more than one channels in the routing tree, we have to specify the continue
keyword, and allow match_re
to decide where the notification will be sent.
An AlertManager routing configuration looks like the example below:
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: default_receiver
routes:
- receiver: default_receiver
group_wait: 30s
- receiver: infrastructure_team
group_wait: 120s
continue: true
match_re:
severity: warning|critical
- receiver: development_team
group_wait: 120s
continue: true
match_re:
severity: app
In the above configuration, the infrastructure_team
and development_team
receivers match different severity levels, but triggering an alert in either receivers will also notify the default_receiver
, so 2 notifications will be sent (1 in the channel where the match happened, and 1 in the channel specified in default_receiver
.
Receivers
Receivers have a name
attribute, and integration-specific configurations for different platforms (e.g. Slack, VictorOps, PagerDuty etc). They are used to specify where the alert should land, and in what format. Templates written in Golang Templates can be used to modify the format/output of alerts.
A real-world receiver configuration will look like:
receivers:
- name: devops
slack_configs:
- api_url: <slack_webhook_url>
channel: "#devops-alerts"
send_resolved: true
title: '{{ template "slack.default.title" . }}'
icon_emoji: '{{ template "slack.default.icon_emoji" . }}'
color: '{{ template "slack.default.color" . }}'
text: >-
{{ range .Alerts -}}
*Severity:* {{ .Labels.severity }}
*Environment:* `{{ if .Labels.namespace }} {{ .Labels.namespace}} {{ else }} {{ .Labels.environment }} {{ end }}`
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ if .Annotations.dashboard_url }}
*Dashboard URL:* {{ .Annotations.dashboard_url }} {{ end }}
{{ if .Annotations.runbook_url }}
*Runbook URL:* {{ .Annotations.runbook_url }} {{ end }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
*Alert for all affected hosts:*
{{ template "slack.default.text" . }}'
actions:
- type: button
text: "Dashboard :grafana:"
url: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
- type: button
text: "Silence 🔕"
url: '{{ template "__alert_silence_link" . }}'
@TODO: Add picture of how the above looks
Templating notifications
Tools
amtool
is a handy tool for interacting with AlertManager. Currently, it supports the following features/functions:
- manage silences
- query alerts
- validate config →
amtool check-config <path-to-config>
- visualise routing trees →
amtool config routes
visualise routing for specific alert/job/label →amtool config routes test 'alertname="HostHighCpuLoad"'