Operators
Node Monitoring
AlertManager

Alert Manager Setup

The following is a guide outlining the steps to setup AlertManager to send alerts when a Tangle node or DKG is being disrupted. If you do not have Tangle node setup yet, please review the Tangle Node Quickstart setup guide here.

In this guide we will configure the following modules to send alerts from a running Tangle node.

  • Alert Manager listens to Prometheus metrics and pushes an alert as soon as a threshold is crossed (CPU % usage for example).

What is Alert Manager?

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts. To learn more about Alertmanager, please visit the official docs site here (opens in a new tab).

Getting Started

Start by downloading the latest releases of the AlertManager (opens in a new tab).

This guide assumes the user has root access to the machine running the Tangle node, and following the below steps inside that machine. As well as, the user has already configured Prometheus on this machine.

1. Download Alertmanager

AMD version:

AMD
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.darwin-amd64.tar.gz

ARM version:

ARM
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.darwin-arm64.tar.gz

2. Extract the Downloaded Files:

Run the following command:

tar
tar xvf alertmanager-*.tar.gz

3. Copy the Extracted Files into /usr/local/bin:

Note: The example below makes use of the linux-amd64 installations, please update to make use of the target system you have installed.

Copy the alertmanager binary and amtool:

cp
sudo cp ./alertmanager-*.linux-amd64/alertmanager /usr/local/bin/ &&
sudo cp ./alertmanager-*.linux-amd64/amtool /usr/local/bin/

4. Create Dedicated Users:

Now we want to create dedicated users for the Alertmanager module we have installed:

useradd
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager

5. Create Directories for Alertmanager:

mkdir
sudo mkdir /etc/alertmanager &&
sudo mkdir /var/lib/alertmanager

6. Change the Ownership for all Directories:

We need to give our user permissions to access these directories:

alertManager:

chown
sudo chown alertmanager:alertmanager /etc/alertmanager/ -R &&
sudo chown alertmanager:alertmanager /var/lib/alertmanager/ -R &&
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager &&
sudo chown alertmanager:alertmanager /usr/local/bin/amtool

7. Finally, let's clean up these directories:

rm
rm -rf ./alertmanager*

Great! You have now installed and setup your environment. The next series of steps will be configuring the service.

Configuration

For implementation examples, refer to our GitHub. (opens in a new tab).

Prometheus

The first thing we need to do is add rules.yml file to our Prometheus configuration:

Let's create the rules.yml file that will give the rules for Alert manager:

nano
sudo touch /etc/prometheus/rules.yml
sudo nano /etc/prometheus/rules.yml

We are going to create 2 basic rules that will trigger an alert in case the instance is down or the CPU usage crosses 80%. You can create all kinds of rules that can triggered, refer to our full list. (opens in a new tab).

Add the following lines and save the file:

group
groups:
  - name: alert_rules
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance $labels.instance down"
          description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
 
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Host high CPU load (instance bLd Kusama)
          description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

The criteria for triggering an alert are set in the expr: part. You can customize these triggers as you see fit.

Then, check the rules file:

promtool rules
promtool check rules /etc/prometheus/rules.yml

And finally, check the Prometheus config file:

promtool check
promtool check config /etc/prometheus/prometheus.yml

Gmail setup

We can use a Gmail address to send the alert emails. For that, we will need to generate an app password from our Gmail account.

Note: we recommend you here to use a dedicated email address for your alerts. Review Google's own guide for proper set-up (opens in a new tab).

Slack notifications

We can also utilize Slack notifications to send the alerts through. For that we need to a specific Slack channel to send the notifications to, and to install Incoming WebHooks Slack application.

To do so, navigate to:

  1. Administration > Manage Apps.
  2. Search for "Incoming Webhooks"
  3. Install into your Slack workspace.

Alertmanager

The Alert manager config file is used to set the external service that will be called when an alert is triggered. Here, we are going to use the Gmail and Slack notification created previously.

Let's create the file:

nano
sudo touch /etc/alertmanager/alertmanager.yml
sudo nano /etc/alertmanager/alertmanager.yml

And add the Gmail configuration to it and save the file:

Gmail config
global:
 resolve_timeout: 1m
 
route:
 receiver: 'gmail-notifications'
 
receivers:
- name: 'gmail-notifications'
  email_configs:
  - to: 'EMAIL-ADDRESS'
    from: 'EMAIL-ADDRESS'
    smarthost: 'smtp.gmail.com:587'
    auth_username: 'EMAIL-ADDRESS'
    auth_identity: 'EMAIL-ADDRESS'
    auth_password: 'EMAIL-ADDRESS'
    send_resolved: true
 
 
# ********************************************************************************************************************************************
# Alert Manager for Slack Notifications  *
# ********************************************************************************************************************************************
 
 global:
   resolve_timeout: 1m
   slack_api_url: 'INSERT SLACK API URL'
 
 route:
   receiver: 'slack-notifications'
 
 receivers:
 - name: 'slack-notifications'
   slack_configs:
   - channel: 'channel-name'
     send_resolved: true
     icon_url: https://avatars3.githubusercontent.com/u/3380462
     title: |-
      [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
      {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
        {{" "}}(
        {{- with .CommonLabels.Remove .GroupLabels.Names }}
          {{- range $index, $label := .SortedPairs -}}
            {{ if $index }}, {{ end }}
            {{- $label.Name }}="{{ $label.Value -}}"
          {{- end }}
        {{- end -}}
        )
      {{- end }}
     text: >-
      {{ range .Alerts -}}
      *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
      *Description:* {{ .Annotations.description }}
      *Details:*
        {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
        {{ end }}
      {{ end }}

Of course, you have to change the email addresses and the auth_password with the one generated from Google previously.

Service Setup

Alert manager

Create and open the Alert manager service file:

create service
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << EOF
[Unit]
  Description=AlertManager Server Service
  Wants=network-online.target
  After=network-online.target
 
[Service]
  User=alertmanager
  Group=alertmanager
  Type=simple
  ExecStart=/usr/local/bin/alertmanager \
   --config.file /etc/alertmanager/alertmanager.yml \
   --storage.path /var/lib/alertmanager \
   --web.external-url=http://localhost:9093 \
   --cluster.advertise-address='0.0.0.0:9093'
 
[Install]
WantedBy=multi-user.target
EOF

Starting the Services

Launch a daemon reload to take the services into account in systemd:

daemon-reload
sudo systemctl daemon-reload

Next, we will want to start the alertManager service:

alertManager:

start service
sudo systemctl start alertmanager.service

And check that they are working fine:

alertManager::

status
sudo systemctl status alertmanager.service

If everything is working adequately, activate the services!

alertManager:

enable
sudo systemctl enable alertmanager.service

Amazing! We have now successfully added alert monitoring for our Tangle node!