Monitoring

Monitoring ExaBGP

Production monitoring for ExaBGP deployments

📊 Monitor health, performance, and BGP state - ensure high availability

Overview

Production monitoring ensures:

ExaBGP process health
BGP sessions stay established
Routes are being announced correctly
Performance is acceptable
Early detection of issues

What to Monitor

Critical Metrics

1. Process Health

Is ExaBGP running?
Process uptime
CPU usage
Memory usage
Process restarts

2. BGP Session State

Session established?
Session uptime
Session flaps (up/down events)
Keepalive/hold-time

3. Route Announcements

Number of routes announced
Number of routes withdrawn
Route changes per minute
Active routes count

4. API Process Health

API process running?
API process restarts
API command rate
API errors

5. System Health

Network connectivity
Disk space
System load

Monitoring Tools

Option 1: Prometheus + Grafana

Most popular stack for modern monitoring

Architecture

ExaBGP → node_exporter → Prometheus → Grafana
         (metrics)        (storage)    (visualization)

Setup

1. Install node_exporter:

# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64

# Run
./node_exporter &

2. Create custom metrics exporter for ExaBGP:

#!/usr/bin/env python3
"""
ExaBGP Prometheus exporter
Exposes metrics on :9100/metrics
"""
from prometheus_client import start_http_server, Gauge, Counter, Info
import subprocess
import time
import re

# Define metrics
exabgp_up = Gauge('exabgp_up', 'ExaBGP process status (1=up, 0=down)')
exabgp_bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session status', ['neighbor'])
exabgp_routes_announced = Gauge('exabgp_routes_announced', 'Number of routes announced')
exabgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn')
exabgp_process_restarts = Counter('exabgp_process_restarts_total', 'Total process restarts')
exabgp_info = Info('exabgp', 'ExaBGP version information')

def check_exabgp_running():
    """Check if ExaBGP process is running"""
    try:
        result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
        return result.returncode == 0
    except:
        return False

def get_bgp_session_state(neighbor_ip):
    """Check BGP session state (example - adapt to your setup)"""
    # This would query your router or parse ExaBGP logs
    # For demo, return True
    return True

def update_metrics():
    """Update all metrics"""
    # Process status
    if check_exabgp_running():
        exabgp_up.set(1)
    else:
        exabgp_up.set(0)
        exabgp_process_restarts.inc()

    # BGP session status (example)
    neighbors = ['192.168.1.1', '192.168.1.2']
    for neighbor in neighbors:
        state = get_bgp_session_state(neighbor)
        exabgp_bgp_session_up.labels(neighbor=neighbor).set(1 if state else 0)

    # Version info (one-time)
    try:
        result = subprocess.run(['exabgp', '--version'], capture_output=True, text=True)
        version = result.stdout.strip()
        exabgp_info.info({'version': version})
    except:
        pass

if __name__ == '__main__':
    # Start metrics server
    start_http_server(9101)  # Metrics on :9101/metrics

    print("ExaBGP Prometheus exporter started on :9101/metrics")

    while True:
        update_metrics()
        time.sleep(15)  # Update every 15 seconds

3. Configure Prometheus (/etc/prometheus/prometheus.yml):

scrape_configs:
  - job_name: 'exabgp'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          instance: 'exabgp-server-1'

4. Create Grafana dashboard (see Dashboard Examples)

Option 2: Nagios / Icinga

Traditional monitoring tools

Check script:

#!/bin/bash
# Nagios check for ExaBGP

# Check if ExaBGP running
if ! pgrep -f exabgp > /dev/null; then
    echo "CRITICAL: ExaBGP not running"
    exit 2
fi

# Check BGP session (example - adjust for your setup)
# Parse exabgp logs or query router

echo "OK: ExaBGP running"
exit 0

Nagios config:

define service {
    use                     generic-service
    host_name               exabgp-server
    service_description     ExaBGP Process
    check_command           check_exabgp
    check_interval          1
}

Option 3: Datadog / New Relic

SaaS monitoring platforms

Datadog custom check:

from datadog import initialize, api
import subprocess

# Initialize
initialize(api_key='YOUR_API_KEY', app_key='YOUR_APP_KEY')

# Send metric
def send_metric(metric_name, value, tags=None):
    api.Metric.send(
        metric=metric_name,
        points=value,
        tags=tags or []
    )

# Check ExaBGP
if check_exabgp_running():
    send_metric('exabgp.up', 1, tags=['env:prod'])
else:
    send_metric('exabgp.up', 0, tags=['env:prod'])

Process Monitoring

Systemd Monitoring

Systemd service with automatic restart:

# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10
User=exabgp
Group=exabgp

# Monitoring
StandardOutput=append:/var/log/exabgp.log
StandardError=append:/var/log/exabgp.log

[Install]
WantedBy=multi-user.target

Monitor service state:

# Check status
systemctl status exabgp

# Monitor restarts
journalctl -u exabgp -f | grep -i restart

Process Watchdog

Simple watchdog script:

#!/bin/bash
# ExaBGP watchdog - restarts if process dies

while true; do
    if ! pgrep -f exabgp > /dev/null; then
        echo "$(date): ExaBGP not running, restarting..." | tee -a /var/log/exabgp-watchdog.log
        systemctl restart exabgp
    fi
    sleep 30
done

Run via cron:

*/5 * * * * /usr/local/bin/exabgp-watchdog.sh

BGP Session Monitoring

Router-Based Monitoring

Query router for BGP session state:

Cisco (SNMP):

#!/usr/bin/env python3
"""
Check BGP session state via SNMP
"""
from pysnmp.hlapi import *

def check_bgp_session(router_ip, neighbor_ip, community='public'):
    """
    Query BGP session state via SNMP
    Returns: True if established, False otherwise
    """
    # BGP peer state OID
    oid = ObjectIdentity('1.3.6.1.2.1.15.3.1.2.' + neighbor_ip)

    errorIndication, errorStatus, errorIndex, varBinds = next(
        getCmd(SnmpEngine(),
               CommunityData(community),
               UdpTransportTarget((router_ip, 161)),
               ContextData(),
               ObjectType(oid))
    )

    if errorIndication or errorStatus:
        return False

    # BGP state: 6 = Established
    state = int(varBinds[0][1])
    return state == 6

# Check session
if check_bgp_session('192.168.1.1', '192.168.1.2'):
    print("BGP session established")
else:
    print("BGP session down!")

Log-Based Monitoring

Parse ExaBGP logs for session state:

#!/bin/bash
# Check BGP session from logs

LOG="/var/log/exabgp.log"
NEIGHBOR="192.168.1.1"

# Check for recent "neighbor up" message
if tail -100 "$LOG" | grep -q "neighbor $NEIGHBOR up"; then
    echo "OK: BGP session to $NEIGHBOR established"
    exit 0
else
    echo "CRITICAL: BGP session to $NEIGHBOR not established"
    exit 2
fi

Route Monitoring

Track Announced Routes

Monitor route announcements:

#!/usr/bin/env python3
"""
Monitor routes announced by ExaBGP
Parse logs and track counts
"""
import re
import time
from collections import defaultdict

routes_announced = defaultdict(int)
routes_withdrawn = defaultdict(int)

def parse_log_line(line):
    """Parse log line for route announcements/withdrawals"""
    # Match: announce route 100.10.0.100/32
    announce_match = re.search(r'announce route ([\d\.]+/\d+)', line)
    if announce_match:
        prefix = announce_match.group(1)
        routes_announced[prefix] += 1
        return ('announce', prefix)

    # Match: withdraw route 100.10.0.100/32
    withdraw_match = re.search(r'withdraw route ([\d\.]+/\d+)', line)
    if withdraw_match:
        prefix = withdraw_match.group(1)
        routes_withdrawn[prefix] += 1
        return ('withdraw', prefix)

    return None

def monitor_routes():
    """Monitor route changes"""
    with open('/var/log/exabgp.log', 'r') as f:
        # Seek to end
        f.seek(0, 2)

        while True:
            line = f.readline()
            if line:
                result = parse_log_line(line)
                if result:
                    action, prefix = result
                    print(f"[{action.upper()}] {prefix}")

                    # Export metrics (Prometheus, etc.)
                    export_route_metrics()

            time.sleep(1)

if __name__ == '__main__':
    monitor_routes()

Router-Side Verification

Verify routes on router:

#!/bin/bash
# Check if expected routes are on router

ROUTER="192.168.1.1"
EXPECTED_ROUTES=("100.10.0.100" "100.10.0.101" "100.10.0.102")

for route in "${EXPECTED_ROUTES[@]}"; do
    # SSH to router and check
    if ssh $ROUTER "show ip bgp $route" | grep -q "BGP routing table entry"; then
        echo "OK: Route $route present"
    else
        echo "CRITICAL: Route $route missing!"
    fi
done

Performance Metrics

CPU and Memory

Monitor resource usage:

#!/usr/bin/env python3
import psutil
import subprocess

def get_exabgp_pid():
    """Get ExaBGP process PID"""
    result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True, text=True)
    if result.returncode == 0:
        return int(result.stdout.strip().split('\n')[0])
    return None

def get_process_metrics(pid):
    """Get CPU and memory usage"""
    try:
        process = psutil.Process(pid)

        return {
            'cpu_percent': process.cpu_percent(interval=1),
            'memory_mb': process.memory_info().rss / 1024 / 1024,
            'num_threads': process.num_threads(),
        }
    except:
        return None

pid = get_exabgp_pid()
if pid:
    metrics = get_process_metrics(pid)
    print(f"CPU: {metrics['cpu_percent']}%")
    print(f"Memory: {metrics['memory_mb']:.1f} MB")
    print(f"Threads: {metrics['num_threads']}")

API Process Metrics

Monitor API process health:

def check_api_process():
    """Check if API healthcheck process is running"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True)
    return result.returncode == 0

def count_api_processes():
    """Count number of API processes"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True, text=True)
    if result.stdout:
        return len(result.stdout.strip().split('\n'))
    return 0

Alerting

Alert Conditions

When to alert:

Critical:

ExaBGP process down
BGP session down
No routes announced (expected routes missing)
API process crashed

Warning:

High CPU usage (> 80%)
High memory usage (> 90%)
Route flapping
BGP session flaps

Info:

Route changes
Process restarts
Configuration reloads

Alert Methods

1. Email Alerts:

import smtplib
from email.mime.text import MIMEText

def send_email_alert(subject, body):
    """Send email alert"""
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = 'exabgp@example.com'
    msg['To'] = 'ops@example.com'

    smtp = smtplib.SMTP('localhost')
    smtp.send_message(msg)
    smtp.quit()

# Example usage
if not check_exabgp_running():
    send_email_alert(
        "CRITICAL: ExaBGP Down",
        "ExaBGP process is not running on server-1"
    )

2. Slack Alerts:

import requests

def send_slack_alert(message, severity='warning'):
    """Send Slack alert"""
    webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

    colors = {
        'critical': '#FF0000',
        'warning': '#FFA500',
        'info': '#0000FF'
    }

    payload = {
        "attachments": [{
            "color": colors.get(severity, '#808080'),
            "title": f"ExaBGP Alert - {severity.upper()}",
            "text": message,
            "footer": "ExaBGP Monitoring",
            "ts": int(time.time())
        }]
    }

    requests.post(webhook_url, json=payload)

# Example
send_slack_alert("BGP session to 192.168.1.1 down!", severity='critical')

3. PagerDuty:

import requests

def send_pagerduty_alert(description, severity='error'):
    """Trigger PagerDuty incident"""
    url = "https://events.pagerduty.com/v2/enqueue"

    payload = {
        "routing_key": "YOUR_ROUTING_KEY",
        "event_action": "trigger",
        "payload": {
            "summary": description,
            "severity": severity,  # critical, error, warning, info
            "source": "exabgp-monitor",
        }
    }

    requests.post(url, json=payload)

# Example
send_pagerduty_alert("ExaBGP process down!", severity='critical')

Log Monitoring

Log Rotation

Configure logrotate:

# /etc/logrotate.d/exabgp
/var/log/exabgp.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload exabgp > /dev/null 2>&1 || true
    endscript
}

Log Analysis

Search for errors:

# Critical errors
grep -i "error\|critical\|fatal" /var/log/exabgp.log

# BGP session changes
grep "neighbor.*up\|neighbor.*down" /var/log/exabgp.log

# Route changes
grep "announce\|withdraw" /var/log/exabgp.log | tail -20

Centralized Logging

Ship logs to ELK/Splunk:

Filebeat config (/etc/filebeat/filebeat.yml):

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/exabgp.log
  fields:
    service: exabgp
    environment: production

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "exabgp-%{+yyyy.MM.dd}"

Dashboard Examples

Grafana Dashboard

Example panels:

Panel 1: ExaBGP Status

Metric: exabgp_up
Visualization: Stat
Thresholds: 1 = green, 0 = red

Panel 2: BGP Sessions

Metric: exabgp_bgp_session_up{neighbor="*"}
Visualization: Table
Show all neighbors with status

Panel 3: Routes Announced

Metric: exabgp_routes_announced
Visualization: Graph
Time series of routes

Panel 4: CPU Usage

Metric: process_cpu_percent{job="exabgp"}
Visualization: Graph

Panel 5: Memory Usage

Metric: process_resident_memory_bytes{job="exabgp"}
Visualization: Graph

Sample Grafana JSON

{
  "dashboard": {
    "title": "ExaBGP Monitoring",
    "panels": [
      {
        "title": "ExaBGP Status",
        "targets": [
          {
            "expr": "exabgp_up"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Routes Announced",
        "targets": [
          {
            "expr": "exabgp_routes_announced"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Best Practices

1. Monitor from Multiple Perspectives

✅ ExaBGP process health (on ExaBGP server)
✅ BGP session state (on router via SNMP/SSH)
✅ Route presence (on router)
✅ End-to-end connectivity (client perspective)

2. Set Appropriate Thresholds

Avoid alert fatigue:

# Good thresholds
CPU_WARNING = 80%
CPU_CRITICAL = 95%

MEMORY_WARNING = 80%
MEMORY_CRITICAL = 90%

BGP_SESSION_DOWN_THRESHOLD = 2 checks (dampening)

3. Monitor Trends

Track over time:

Route announcement rate
BGP session uptime
Resource usage trends
Error rates

4. Implement Health Checks

Synthetic monitoring:

#!/bin/bash
# End-to-end health check

# Check if service IP responds
if curl -sf http://100.10.0.100/health > /dev/null; then
    echo "OK: Service responding"
else
    echo "CRITICAL: Service not responding"
fi

5. Document Normal Baselines

Know what's normal:

Typical CPU usage: 5-10%
Typical memory: 50-100 MB
Expected routes: 10
BGP session uptime: > 30 days

Next Steps

Learn More

Debugging - Troubleshooting guide
Service HA - HA patterns
API Overview - API integration

Tools

Prometheus - Metrics collection
Grafana - Visualization
Nagios - Traditional monitoring

Ready to set up monitoring? See Quick Start →

👻 Ghost written by Claude (Anthropic AI)

🏠 Home

🚀 Getting Started

🔧 API

🛡️ Use Cases

🌐 Address Families

FlowSpec
- Match Conditions
- Actions Reference

⚙️ Configuration

🔍 Operations

📚 Reference

🔄 Migration

🌍 Community

🔗 External

GitHub Repo ↗
Slack ↗
Issues ↗

👻 Ghost written by Claude (Anthropic AI)

Monitoring

Monitoring ExaBGP

Table of Contents

Overview

What to Monitor

Critical Metrics

Monitoring Tools

Option 1: Prometheus + Grafana

Architecture

Setup

Option 2: Nagios / Icinga

Option 3: Datadog / New Relic

Process Monitoring

Systemd Monitoring

Process Watchdog

BGP Session Monitoring

Router-Based Monitoring

Log-Based Monitoring

Route Monitoring

Track Announced Routes

Router-Side Verification

Performance Metrics

CPU and Memory

API Process Metrics

Alerting

Alert Conditions

Alert Methods

Log Monitoring

Log Rotation

Log Analysis

Centralized Logging

Dashboard Examples

Grafana Dashboard

Sample Grafana JSON

Best Practices

1. Monitor from Multiple Perspectives

2. Set Appropriate Thresholds

3. Monitor Trends

4. Implement Health Checks

5. Document Normal Baselines

Next Steps

Learn More

Tools

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!