AlertManager Cookbook

Introduction

AlertManager is the routing layer of alerts for our monitoring stack at The Remote Company. It’s there to receive the alerts from Prometheus (or other alert processor, e.g. VictoriaMetrics VMAlert), match them to appropriate receivers and serve them to notification channels.

@TODO:

Configuration

The only configuration point for AlertManager is alertmanager.yml. The main units in the configuration file are routes and receivers.

Routes

Routes are the part that actually sends the alert to a notification channel. The other function that routes serve is making routing decisions by matching severity levels and/or labels on the alerts received. Receiver configuration is mapped to routes, either to execute notifications or be evaluated further.

By default, receivers match all severity levels and and alert exits the routing tree on the first match. In case we want a triggered alert to notify more than one channels in the routing tree, we have to specify the continue keyword, and allow match_re to decide where the notification will be sent.

An AlertManager routing configuration looks like the example below:

route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: default_receiver
  routes:
    - receiver: default_receiver
      group_wait: 30s
    - receiver: infrastructure_team
      group_wait: 120s
      continue: true
      match_re: 
        severity: warning|critical
    - receiver: development_team
      group_wait: 120s
      continue: true
      match_re:
        severity: app

In the above configuration, the infrastructure_team and development_team receivers match different severity levels, but triggering an alert in either receivers will also notify the default_receiver, so 2 notifications will be sent (1 in the channel where the match happened, and 1 in the channel specified in default_receiver.

Receivers

Receivers have a name attribute, and integration-specific configurations for different platforms (e.g. Slack, VictorOps, PagerDuty etc). They are used to specify where the alert should land, and in what format. Templates written in Golang Templates can be used to modify the format/output of alerts.

A real-world receiver configuration will look like:

receivers:
  - name: devops
    slack_configs:
      - api_url: <slack_webhook_url>
        channel: "#devops-alerts"
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        icon_emoji: '{{ template "slack.default.icon_emoji" . }}'
        color: '{{ template "slack.default.color" . }}'
        text: >-
          {{ range .Alerts -}}
          *Severity:* {{ .Labels.severity }}

          *Environment:* `{{ if .Labels.namespace }} {{ .Labels.namespace}} {{ else }} {{ .Labels.environment }} {{ end }}`

          *Summary:* {{ .Annotations.summary }}

          *Description:* {{ .Annotations.description }}

          {{ if .Annotations.dashboard_url }}
          *Dashboard URL:* {{ .Annotations.dashboard_url }} {{ end }}

          {{ if .Annotations.runbook_url }}
          *Runbook URL:* {{ .Annotations.runbook_url }} {{ end }}

          *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}

          *Alert for all affected hosts:*
          {{ template "slack.default.text" . }}'          
        actions:
          - type: button
            text: "Dashboard :grafana:"
            url: "{{ (index .Alerts 0).Annotations.dashboard_url }}"
          - type: button
            text: "Silence 🔕"
            url: '{{ template "__alert_silence_link" . }}'

@TODO: Add picture of how the above looks

Templating notifications

Tools

amtool is a handy tool for interacting with AlertManager. Currently, it supports the following features/functions:

manage silences
query alerts
validate config → amtool check-config <path-to-config>
visualise routing trees → amtool config routes visualise routing for specific alert/job/label → amtool config routes test 'alertname="HostHighCpuLoad"'