Skip to content

Monitoring

Thomas Mangin edited this page Nov 13, 2025 · 4 revisions

Monitoring ExaBGP

Production monitoring for ExaBGP deployments

πŸ“Š Monitor health, performance, and BGP state - ensure high availability


Table of Contents


Overview

Production monitoring ensures:

  • ExaBGP process health
  • BGP sessions stay established
  • Routes are being announced correctly
  • Performance is acceptable
  • Early detection of issues

What to Monitor

Critical Metrics

1. Process Health

  • Is ExaBGP running?
  • Process uptime
  • CPU usage
  • Memory usage
  • Process restarts

2. BGP Session State

  • Session established?
  • Session uptime
  • Session flaps (up/down events)
  • Keepalive/hold-time

3. Route Announcements

  • Number of routes announced
  • Number of routes withdrawn
  • Route changes per minute
  • Active routes count

4. API Process Health

  • API process running?
  • API process restarts
  • API command rate
  • API errors

5. System Health

  • Network connectivity
  • Disk space
  • System load

Monitoring Tools

Option 1: Prometheus + Grafana

Most popular stack for modern monitoring

Architecture

ExaBGP β†’ node_exporter β†’ Prometheus β†’ Grafana
         (metrics)        (storage)    (visualization)

Setup

1. Install node_exporter:

# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64

# Run
./node_exporter &

2. Create custom metrics exporter for ExaBGP:

#!/usr/bin/env python3
"""
ExaBGP Prometheus exporter
Exposes metrics on :9100/metrics
"""
from prometheus_client import start_http_server, Gauge, Counter, Info
import subprocess
import time
import re

# Define metrics
exabgp_up = Gauge('exabgp_up', 'ExaBGP process status (1=up, 0=down)')
exabgp_bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session status', ['neighbor'])
exabgp_routes_announced = Gauge('exabgp_routes_announced', 'Number of routes announced')
exabgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn')
exabgp_process_restarts = Counter('exabgp_process_restarts_total', 'Total process restarts')
exabgp_info = Info('exabgp', 'ExaBGP version information')

def check_exabgp_running():
    """Check if ExaBGP process is running"""
    try:
        result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
        return result.returncode == 0
    except:
        return False

def get_bgp_session_state(neighbor_ip):
    """Check BGP session state (example - adapt to your setup)"""
    # This would query your router or parse ExaBGP logs
    # For demo, return True
    return True

def update_metrics():
    """Update all metrics"""
    # Process status
    if check_exabgp_running():
        exabgp_up.set(1)
    else:
        exabgp_up.set(0)
        exabgp_process_restarts.inc()

    # BGP session status (example)
    neighbors = ['192.168.1.1', '192.168.1.2']
    for neighbor in neighbors:
        state = get_bgp_session_state(neighbor)
        exabgp_bgp_session_up.labels(neighbor=neighbor).set(1 if state else 0)

    # Version info (one-time)
    try:
        result = subprocess.run(['exabgp', '--version'], capture_output=True, text=True)
        version = result.stdout.strip()
        exabgp_info.info({'version': version})
    except:
        pass

if __name__ == '__main__':
    # Start metrics server
    start_http_server(9101)  # Metrics on :9101/metrics

    print("ExaBGP Prometheus exporter started on :9101/metrics")

    while True:
        update_metrics()
        time.sleep(15)  # Update every 15 seconds

3. Configure Prometheus (/etc/prometheus/prometheus.yml):

scrape_configs:
  - job_name: 'exabgp'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          instance: 'exabgp-server-1'

4. Create Grafana dashboard (see Dashboard Examples)


Option 2: Nagios / Icinga

Traditional monitoring tools

Check script:

#!/bin/bash
# Nagios check for ExaBGP

# Check if ExaBGP running
if ! pgrep -f exabgp > /dev/null; then
    echo "CRITICAL: ExaBGP not running"
    exit 2
fi

# Check BGP session (example - adjust for your setup)
# Parse exabgp logs or query router

echo "OK: ExaBGP running"
exit 0

Nagios config:

define service {
    use                     generic-service
    host_name               exabgp-server
    service_description     ExaBGP Process
    check_command           check_exabgp
    check_interval          1
}

Option 3: Datadog / New Relic

SaaS monitoring platforms

Datadog custom check:

from datadog import initialize, api
import subprocess

# Initialize
initialize(api_key='YOUR_API_KEY', app_key='YOUR_APP_KEY')

# Send metric
def send_metric(metric_name, value, tags=None):
    api.Metric.send(
        metric=metric_name,
        points=value,
        tags=tags or []
    )

# Check ExaBGP
if check_exabgp_running():
    send_metric('exabgp.up', 1, tags=['env:prod'])
else:
    send_metric('exabgp.up', 0, tags=['env:prod'])

Process Monitoring

Systemd Monitoring

Systemd service with automatic restart:

# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10
User=exabgp
Group=exabgp

# Monitoring
StandardOutput=append:/var/log/exabgp.log
StandardError=append:/var/log/exabgp.log

[Install]
WantedBy=multi-user.target

Monitor service state:

# Check status
systemctl status exabgp

# Monitor restarts
journalctl -u exabgp -f | grep -i restart

Process Watchdog

Simple watchdog script:

#!/bin/bash
# ExaBGP watchdog - restarts if process dies

while true; do
    if ! pgrep -f exabgp > /dev/null; then
        echo "$(date): ExaBGP not running, restarting..." | tee -a /var/log/exabgp-watchdog.log
        systemctl restart exabgp
    fi
    sleep 30
done

Run via cron:

*/5 * * * * /usr/local/bin/exabgp-watchdog.sh

BGP Session Monitoring

Router-Based Monitoring

Query router for BGP session state:

Cisco (SNMP):

#!/usr/bin/env python3
"""
Check BGP session state via SNMP
"""
from pysnmp.hlapi import *

def check_bgp_session(router_ip, neighbor_ip, community='public'):
    """
    Query BGP session state via SNMP
    Returns: True if established, False otherwise
    """
    # BGP peer state OID
    oid = ObjectIdentity('1.3.6.1.2.1.15.3.1.2.' + neighbor_ip)

    errorIndication, errorStatus, errorIndex, varBinds = next(
        getCmd(SnmpEngine(),
               CommunityData(community),
               UdpTransportTarget((router_ip, 161)),
               ContextData(),
               ObjectType(oid))
    )

    if errorIndication or errorStatus:
        return False

    # BGP state: 6 = Established
    state = int(varBinds[0][1])
    return state == 6

# Check session
if check_bgp_session('192.168.1.1', '192.168.1.2'):
    print("BGP session established")
else:
    print("BGP session down!")

Log-Based Monitoring

Parse ExaBGP logs for session state:

#!/bin/bash
# Check BGP session from logs

LOG="/var/log/exabgp.log"
NEIGHBOR="192.168.1.1"

# Check for recent "neighbor up" message
if tail -100 "$LOG" | grep -q "neighbor $NEIGHBOR up"; then
    echo "OK: BGP session to $NEIGHBOR established"
    exit 0
else
    echo "CRITICAL: BGP session to $NEIGHBOR not established"
    exit 2
fi

Route Monitoring

Track Announced Routes

Monitor route announcements:

#!/usr/bin/env python3
"""
Monitor routes announced by ExaBGP
Parse logs and track counts
"""
import re
import time
from collections import defaultdict

routes_announced = defaultdict(int)
routes_withdrawn = defaultdict(int)

def parse_log_line(line):
    """Parse log line for route announcements/withdrawals"""
    # Match: announce route 100.10.0.100/32
    announce_match = re.search(r'announce route ([\d\.]+/\d+)', line)
    if announce_match:
        prefix = announce_match.group(1)
        routes_announced[prefix] += 1
        return ('announce', prefix)

    # Match: withdraw route 100.10.0.100/32
    withdraw_match = re.search(r'withdraw route ([\d\.]+/\d+)', line)
    if withdraw_match:
        prefix = withdraw_match.group(1)
        routes_withdrawn[prefix] += 1
        return ('withdraw', prefix)

    return None

def monitor_routes():
    """Monitor route changes"""
    with open('/var/log/exabgp.log', 'r') as f:
        # Seek to end
        f.seek(0, 2)

        while True:
            line = f.readline()
            if line:
                result = parse_log_line(line)
                if result:
                    action, prefix = result
                    print(f"[{action.upper()}] {prefix}")

                    # Export metrics (Prometheus, etc.)
                    export_route_metrics()

            time.sleep(1)

if __name__ == '__main__':
    monitor_routes()

Router-Side Verification

Verify routes on router:

#!/bin/bash
# Check if expected routes are on router

ROUTER="192.168.1.1"
EXPECTED_ROUTES=("100.10.0.100" "100.10.0.101" "100.10.0.102")

for route in "${EXPECTED_ROUTES[@]}"; do
    # SSH to router and check
    if ssh $ROUTER "show ip bgp $route" | grep -q "BGP routing table entry"; then
        echo "OK: Route $route present"
    else
        echo "CRITICAL: Route $route missing!"
    fi
done

Performance Metrics

CPU and Memory

Monitor resource usage:

#!/usr/bin/env python3
import psutil
import subprocess

def get_exabgp_pid():
    """Get ExaBGP process PID"""
    result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True, text=True)
    if result.returncode == 0:
        return int(result.stdout.strip().split('\n')[0])
    return None

def get_process_metrics(pid):
    """Get CPU and memory usage"""
    try:
        process = psutil.Process(pid)

        return {
            'cpu_percent': process.cpu_percent(interval=1),
            'memory_mb': process.memory_info().rss / 1024 / 1024,
            'num_threads': process.num_threads(),
        }
    except:
        return None

pid = get_exabgp_pid()
if pid:
    metrics = get_process_metrics(pid)
    print(f"CPU: {metrics['cpu_percent']}%")
    print(f"Memory: {metrics['memory_mb']:.1f} MB")
    print(f"Threads: {metrics['num_threads']}")

API Process Metrics

Monitor API process health:

def check_api_process():
    """Check if API healthcheck process is running"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True)
    return result.returncode == 0

def count_api_processes():
    """Count number of API processes"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True, text=True)
    if result.stdout:
        return len(result.stdout.strip().split('\n'))
    return 0

Alerting

Alert Conditions

When to alert:

Critical:

  • ExaBGP process down
  • BGP session down
  • No routes announced (expected routes missing)
  • API process crashed

Warning:

  • High CPU usage (> 80%)
  • High memory usage (> 90%)
  • Route flapping
  • BGP session flaps

Info:

  • Route changes
  • Process restarts
  • Configuration reloads

Alert Methods

1. Email Alerts:

import smtplib
from email.mime.text import MIMEText

def send_email_alert(subject, body):
    """Send email alert"""
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = 'exabgp@example.com'
    msg['To'] = 'ops@example.com'

    smtp = smtplib.SMTP('localhost')
    smtp.send_message(msg)
    smtp.quit()

# Example usage
if not check_exabgp_running():
    send_email_alert(
        "CRITICAL: ExaBGP Down",
        "ExaBGP process is not running on server-1"
    )

2. Slack Alerts:

import requests

def send_slack_alert(message, severity='warning'):
    """Send Slack alert"""
    webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

    colors = {
        'critical': '#FF0000',
        'warning': '#FFA500',
        'info': '#0000FF'
    }

    payload = {
        "attachments": [{
            "color": colors.get(severity, '#808080'),
            "title": f"ExaBGP Alert - {severity.upper()}",
            "text": message,
            "footer": "ExaBGP Monitoring",
            "ts": int(time.time())
        }]
    }

    requests.post(webhook_url, json=payload)

# Example
send_slack_alert("BGP session to 192.168.1.1 down!", severity='critical')

3. PagerDuty:

import requests

def send_pagerduty_alert(description, severity='error'):
    """Trigger PagerDuty incident"""
    url = "https://events.pagerduty.com/v2/enqueue"

    payload = {
        "routing_key": "YOUR_ROUTING_KEY",
        "event_action": "trigger",
        "payload": {
            "summary": description,
            "severity": severity,  # critical, error, warning, info
            "source": "exabgp-monitor",
        }
    }

    requests.post(url, json=payload)

# Example
send_pagerduty_alert("ExaBGP process down!", severity='critical')

Log Monitoring

Log Rotation

Configure logrotate:

# /etc/logrotate.d/exabgp
/var/log/exabgp.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload exabgp > /dev/null 2>&1 || true
    endscript
}

Log Analysis

Search for errors:

# Critical errors
grep -i "error\|critical\|fatal" /var/log/exabgp.log

# BGP session changes
grep "neighbor.*up\|neighbor.*down" /var/log/exabgp.log

# Route changes
grep "announce\|withdraw" /var/log/exabgp.log | tail -20

Centralized Logging

Ship logs to ELK/Splunk:

Filebeat config (/etc/filebeat/filebeat.yml):

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/exabgp.log
  fields:
    service: exabgp
    environment: production

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "exabgp-%{+yyyy.MM.dd}"

Dashboard Examples

Grafana Dashboard

Example panels:

Panel 1: ExaBGP Status

  • Metric: exabgp_up
  • Visualization: Stat
  • Thresholds: 1 = green, 0 = red

Panel 2: BGP Sessions

  • Metric: exabgp_bgp_session_up{neighbor="*"}
  • Visualization: Table
  • Show all neighbors with status

Panel 3: Routes Announced

  • Metric: exabgp_routes_announced
  • Visualization: Graph
  • Time series of routes

Panel 4: CPU Usage

  • Metric: process_cpu_percent{job="exabgp"}
  • Visualization: Graph

Panel 5: Memory Usage

  • Metric: process_resident_memory_bytes{job="exabgp"}
  • Visualization: Graph

Sample Grafana JSON

{
  "dashboard": {
    "title": "ExaBGP Monitoring",
    "panels": [
      {
        "title": "ExaBGP Status",
        "targets": [
          {
            "expr": "exabgp_up"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Routes Announced",
        "targets": [
          {
            "expr": "exabgp_routes_announced"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Best Practices

1. Monitor from Multiple Perspectives

βœ… ExaBGP process health (on ExaBGP server)
βœ… BGP session state (on router via SNMP/SSH)
βœ… Route presence (on router)
βœ… End-to-end connectivity (client perspective)

2. Set Appropriate Thresholds

Avoid alert fatigue:

# Good thresholds
CPU_WARNING = 80%
CPU_CRITICAL = 95%

MEMORY_WARNING = 80%
MEMORY_CRITICAL = 90%

BGP_SESSION_DOWN_THRESHOLD = 2 checks (dampening)

3. Monitor Trends

Track over time:

  • Route announcement rate
  • BGP session uptime
  • Resource usage trends
  • Error rates

4. Implement Health Checks

Synthetic monitoring:

#!/bin/bash
# End-to-end health check

# Check if service IP responds
if curl -sf http://100.10.0.100/health > /dev/null; then
    echo "OK: Service responding"
else
    echo "CRITICAL: Service not responding"
fi

5. Document Normal Baselines

Know what's normal:

  • Typical CPU usage: 5-10%
  • Typical memory: 50-100 MB
  • Expected routes: 10
  • BGP session uptime: > 30 days

Next Steps

Learn More

Tools


Ready to set up monitoring? See Quick Start β†’


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally