Custom Alertmanager alerts based on service Config

Problem:

We have a set of microservices that have metadata info like the owner, slack channels, etc. We wanted to route the application alerts to the channels given in the metadata. Currently, the alertmanager is not dynamic enough to route based on certain criteria

Solution:

Inspired by the article: https://zhimin-wen.medium.com/custom-notifications-with-alert-managers-webhook-receiver-in-kubernetes-8e1152ba2c31
we came with this solution where we send the alertmanager alerts to our custom webhook and some code to route to the destinations such as slack.

Preface:
Service config is the metadata info exposed by each application as the service config and is exposed as a JSON object

"metadata":{
"name":"sample-application",
"description":"description",
"protocol":"THRIFT",
"port":8080,
"namespace":"namespace",
"priority":1,
"team":{
"name":"Core",
"chat":"Core Platform"
},
"stage":"Live",
"monitoring":{
"group":"group",
},
"owners":[
{
"name":"",
"chat":"sample-app"
}
]
}
},

Step 1: Create a slack bot, that will send messages on behalf of your application.

  • Create a new slack bot application
 — channels:read
— chat:write
  • Generate a bot token that you should store for use later on — you can find it in your app’s settings under Install App > Bot User OAuth Access Token.

Step 2: Python script to receive the alert from the Alertmanager and look up the service config and parse the relevant metadata and route the alerts to specific application team channels

Create a configmap using the below python script, This enables us to mount the script inside the kubernetes Pod.

## entrypoint.py Script
import json
from flask import Flask, request
from gevent.pywsgi import WSGIServer
from collections import namedtuple
import requests
import os
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError


app = Flask(__name__)

@app.route('/webhook', methods=['POST'])

def webhook():
prometheus_data = json.loads(request.data, object_hook=lambda d: namedtuple('X', d.keys())(*d.values()))
config_url = os.environ['URL']
environment = os.environ['ENV']
service_list = [alert.labels.container for alert in prometheus_data.alerts if alert.status == "firing" if alert.labels.alertname == "PodCrashingStatusCritical"]
registry = requests.get(config_url)
test = registry.json()
for i, j in test.items():
for k in j.keys():
if k == "metadata":
metadata = j[k]
for l in metadata.keys():
if (l == "name"):
if (metadata[l] in service_list):
for owner in metadata["owners"]:
client = WebClient(token=os.environ['SLACK_BOT_TOKEN'])
channel = "#{}".format(owner[chat])
message = "{0} pods are crashing in {1} environment".format(metadata[l], environment)
try:
response = client.chat_postMessage(channel=channel, text=message)
except SlackApiError as e:
print(f"Got an error: {e.response['error']}")
return "200"

if __name__ == '__main__':
WSGIServer(('0.0.0.0', 5000), app).serve_forever()

Step 3: Deploy the script

  • A dockerfile is created to wrap the script into a docker container
Dockerfile
FORM python:3.9.5-buster
RUN pip3 install flask gevent requests slack_sdk
  • Create a secret with the required parameters used in the python file
kubectl create secret generic python-webhook-secret -n monitor --from-literal=SLACK_BOT_TOKEN=XXXXXXXXXXXXXXXXXX --from-literal=URL=<http://servcice-config/?metadata=true> --from-literal=ENV=stage
  • Create a deployment spec file to deploy the docker container to the existing kubernetes environment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: webhook-test
labels:
app: webhook-test
spec:
selector:
matchLabels:
app: webhook-test
replicas: 1
template:
metadata:
labels:
app: webhook-test
spec:
containers:
- name: webhook-test
image: gcr.io/rep-ops/python-webhook:74
command: ["python3","/entrypoint/entrypoint.py"]
env:
- name: SLACK_BOT_TOKEN
valueFrom:
secretKeyRef:
name: python-webhook-secret
key: SLACK_BOT_TOKEN
- name: URL
valueFrom:
secretKeyRef:
name: python-webhook-secret
key: URL
- name: ENV
valueFrom:
secretKeyRef:
name: python-webhook-secret
key: ENV
ports:
- containerPort: 5000
volumeMounts:
- name: entrypoint-volume
mountPath: /entrypoint/
volumes:
- name: entrypoint-volume
configMap:
name: python-webhook

Step 4: Once deployed, the pod exposes an URL that will be used in the Alertmanager config.

apiVersion: v1
kind: ConfigMap
metadata:
name: monitoring-prometheus-alertmanager
namespace: monitor
data:
alertmanager.yml: |-
global:
----
receivers:
- name: critical-receiver
webhook_configs:
- url: "http://webhook-test:5000/webhook"

Sample alert that Prometheus sends to the python webhook when the alert PodCarshingStatusCritical is fired.

{
"receiver":"critical-receiver",
"status":"firing",
"alerts":[
{
"status":"firing",
"labels":{
"alertname":"PodCrashingStatusCritical",
"container":"r4e-example",
"endpoint":"http",
"environment":"e2e",
"instance":"XX.XX.XX.XXX:8080",
"job":"kube-state-metrics",
"namespace":"e2e",
"pod":"r4e-example-86886cf697-r9g4b",
"prometheus":"monitor/prometheus-prometheus",
"reason":"CrashLoopBackOff",
"service":"prometheus-kube-state-metrics",
"severity":"critical"
},
"annotations":{
"message":"One or more pods in CrashLoopBackOff state for the last 15mins",
"runbook_url":"http://docs.reputation.ec2/devops/monitoring/prometheus-alert-runbooks/#alert-name-podcrashingstatus",
"summary":"Pod e2e/r4e-example-86886cf697-r9g4b (r4e-example) is restarting with reason CrashLoopBackOff"
},
"startsAt":"2021-05-14T10:33:49.738Z",
"endsAt":"0001-01-01T00:00:00Z",
}

Sample slack alert that is posted into our slack channel from the python webhook.

Senior DevOps Engineer at Reputation.com