-
Notifications
You must be signed in to change notification settings - Fork 460
Monitoring
Production monitoring for ExaBGP deployments
π Monitor health, performance, and BGP state - ensure high availability
- Overview
- What to Monitor
- Monitoring Tools
- Process Monitoring
- BGP Session Monitoring
- Route Monitoring
- Performance Metrics
- Alerting
- Log Monitoring
- Dashboard Examples
- Best Practices
Production monitoring ensures:
- ExaBGP process health
- BGP sessions stay established
- Routes are being announced correctly
- Performance is acceptable
- Early detection of issues
1. Process Health
- Is ExaBGP running?
- Process uptime
- CPU usage
- Memory usage
- Process restarts
2. BGP Session State
- Session established?
- Session uptime
- Session flaps (up/down events)
- Keepalive/hold-time
3. Route Announcements
- Number of routes announced
- Number of routes withdrawn
- Route changes per minute
- Active routes count
4. API Process Health
- API process running?
- API process restarts
- API command rate
- API errors
5. System Health
- Network connectivity
- Disk space
- System load
Most popular stack for modern monitoring
ExaBGP β node_exporter β Prometheus β Grafana
(metrics) (storage) (visualization)
1. Install node_exporter:
# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
# Run
./node_exporter &2. Create custom metrics exporter for ExaBGP:
#!/usr/bin/env python3
"""
ExaBGP Prometheus exporter
Exposes metrics on :9100/metrics
"""
from prometheus_client import start_http_server, Gauge, Counter, Info
import subprocess
import time
import re
# Define metrics
exabgp_up = Gauge('exabgp_up', 'ExaBGP process status (1=up, 0=down)')
exabgp_bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session status', ['neighbor'])
exabgp_routes_announced = Gauge('exabgp_routes_announced', 'Number of routes announced')
exabgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn')
exabgp_process_restarts = Counter('exabgp_process_restarts_total', 'Total process restarts')
exabgp_info = Info('exabgp', 'ExaBGP version information')
def check_exabgp_running():
"""Check if ExaBGP process is running"""
try:
result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
return result.returncode == 0
except:
return False
def get_bgp_session_state(neighbor_ip):
"""Check BGP session state (example - adapt to your setup)"""
# This would query your router or parse ExaBGP logs
# For demo, return True
return True
def update_metrics():
"""Update all metrics"""
# Process status
if check_exabgp_running():
exabgp_up.set(1)
else:
exabgp_up.set(0)
exabgp_process_restarts.inc()
# BGP session status (example)
neighbors = ['192.168.1.1', '192.168.1.2']
for neighbor in neighbors:
state = get_bgp_session_state(neighbor)
exabgp_bgp_session_up.labels(neighbor=neighbor).set(1 if state else 0)
# Version info (one-time)
try:
result = subprocess.run(['exabgp', '--version'], capture_output=True, text=True)
version = result.stdout.strip()
exabgp_info.info({'version': version})
except:
pass
if __name__ == '__main__':
# Start metrics server
start_http_server(9101) # Metrics on :9101/metrics
print("ExaBGP Prometheus exporter started on :9101/metrics")
while True:
update_metrics()
time.sleep(15) # Update every 15 seconds3. Configure Prometheus (/etc/prometheus/prometheus.yml):
scrape_configs:
- job_name: 'exabgp'
static_configs:
- targets: ['localhost:9101']
labels:
instance: 'exabgp-server-1'4. Create Grafana dashboard (see Dashboard Examples)
Traditional monitoring tools
Check script:
#!/bin/bash
# Nagios check for ExaBGP
# Check if ExaBGP running
if ! pgrep -f exabgp > /dev/null; then
echo "CRITICAL: ExaBGP not running"
exit 2
fi
# Check BGP session (example - adjust for your setup)
# Parse exabgp logs or query router
echo "OK: ExaBGP running"
exit 0Nagios config:
define service {
use generic-service
host_name exabgp-server
service_description ExaBGP Process
check_command check_exabgp
check_interval 1
}
SaaS monitoring platforms
Datadog custom check:
from datadog import initialize, api
import subprocess
# Initialize
initialize(api_key='YOUR_API_KEY', app_key='YOUR_APP_KEY')
# Send metric
def send_metric(metric_name, value, tags=None):
api.Metric.send(
metric=metric_name,
points=value,
tags=tags or []
)
# Check ExaBGP
if check_exabgp_running():
send_metric('exabgp.up', 1, tags=['env:prod'])
else:
send_metric('exabgp.up', 0, tags=['env:prod'])Systemd service with automatic restart:
# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10
User=exabgp
Group=exabgp
# Monitoring
StandardOutput=append:/var/log/exabgp.log
StandardError=append:/var/log/exabgp.log
[Install]
WantedBy=multi-user.targetMonitor service state:
# Check status
systemctl status exabgp
# Monitor restarts
journalctl -u exabgp -f | grep -i restartSimple watchdog script:
#!/bin/bash
# ExaBGP watchdog - restarts if process dies
while true; do
if ! pgrep -f exabgp > /dev/null; then
echo "$(date): ExaBGP not running, restarting..." | tee -a /var/log/exabgp-watchdog.log
systemctl restart exabgp
fi
sleep 30
doneRun via cron:
*/5 * * * * /usr/local/bin/exabgp-watchdog.shQuery router for BGP session state:
Cisco (SNMP):
#!/usr/bin/env python3
"""
Check BGP session state via SNMP
"""
from pysnmp.hlapi import *
def check_bgp_session(router_ip, neighbor_ip, community='public'):
"""
Query BGP session state via SNMP
Returns: True if established, False otherwise
"""
# BGP peer state OID
oid = ObjectIdentity('1.3.6.1.2.1.15.3.1.2.' + neighbor_ip)
errorIndication, errorStatus, errorIndex, varBinds = next(
getCmd(SnmpEngine(),
CommunityData(community),
UdpTransportTarget((router_ip, 161)),
ContextData(),
ObjectType(oid))
)
if errorIndication or errorStatus:
return False
# BGP state: 6 = Established
state = int(varBinds[0][1])
return state == 6
# Check session
if check_bgp_session('192.168.1.1', '192.168.1.2'):
print("BGP session established")
else:
print("BGP session down!")Parse ExaBGP logs for session state:
#!/bin/bash
# Check BGP session from logs
LOG="/var/log/exabgp.log"
NEIGHBOR="192.168.1.1"
# Check for recent "neighbor up" message
if tail -100 "$LOG" | grep -q "neighbor $NEIGHBOR up"; then
echo "OK: BGP session to $NEIGHBOR established"
exit 0
else
echo "CRITICAL: BGP session to $NEIGHBOR not established"
exit 2
fiMonitor route announcements:
#!/usr/bin/env python3
"""
Monitor routes announced by ExaBGP
Parse logs and track counts
"""
import re
import time
from collections import defaultdict
routes_announced = defaultdict(int)
routes_withdrawn = defaultdict(int)
def parse_log_line(line):
"""Parse log line for route announcements/withdrawals"""
# Match: announce route 100.10.0.100/32
announce_match = re.search(r'announce route ([\d\.]+/\d+)', line)
if announce_match:
prefix = announce_match.group(1)
routes_announced[prefix] += 1
return ('announce', prefix)
# Match: withdraw route 100.10.0.100/32
withdraw_match = re.search(r'withdraw route ([\d\.]+/\d+)', line)
if withdraw_match:
prefix = withdraw_match.group(1)
routes_withdrawn[prefix] += 1
return ('withdraw', prefix)
return None
def monitor_routes():
"""Monitor route changes"""
with open('/var/log/exabgp.log', 'r') as f:
# Seek to end
f.seek(0, 2)
while True:
line = f.readline()
if line:
result = parse_log_line(line)
if result:
action, prefix = result
print(f"[{action.upper()}] {prefix}")
# Export metrics (Prometheus, etc.)
export_route_metrics()
time.sleep(1)
if __name__ == '__main__':
monitor_routes()Verify routes on router:
#!/bin/bash
# Check if expected routes are on router
ROUTER="192.168.1.1"
EXPECTED_ROUTES=("100.10.0.100" "100.10.0.101" "100.10.0.102")
for route in "${EXPECTED_ROUTES[@]}"; do
# SSH to router and check
if ssh $ROUTER "show ip bgp $route" | grep -q "BGP routing table entry"; then
echo "OK: Route $route present"
else
echo "CRITICAL: Route $route missing!"
fi
doneMonitor resource usage:
#!/usr/bin/env python3
import psutil
import subprocess
def get_exabgp_pid():
"""Get ExaBGP process PID"""
result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True, text=True)
if result.returncode == 0:
return int(result.stdout.strip().split('\n')[0])
return None
def get_process_metrics(pid):
"""Get CPU and memory usage"""
try:
process = psutil.Process(pid)
return {
'cpu_percent': process.cpu_percent(interval=1),
'memory_mb': process.memory_info().rss / 1024 / 1024,
'num_threads': process.num_threads(),
}
except:
return None
pid = get_exabgp_pid()
if pid:
metrics = get_process_metrics(pid)
print(f"CPU: {metrics['cpu_percent']}%")
print(f"Memory: {metrics['memory_mb']:.1f} MB")
print(f"Threads: {metrics['num_threads']}")Monitor API process health:
def check_api_process():
"""Check if API healthcheck process is running"""
result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True)
return result.returncode == 0
def count_api_processes():
"""Count number of API processes"""
result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True, text=True)
if result.stdout:
return len(result.stdout.strip().split('\n'))
return 0When to alert:
Critical:
- ExaBGP process down
- BGP session down
- No routes announced (expected routes missing)
- API process crashed
Warning:
- High CPU usage (> 80%)
- High memory usage (> 90%)
- Route flapping
- BGP session flaps
Info:
- Route changes
- Process restarts
- Configuration reloads
1. Email Alerts:
import smtplib
from email.mime.text import MIMEText
def send_email_alert(subject, body):
"""Send email alert"""
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = 'exabgp@example.com'
msg['To'] = 'ops@example.com'
smtp = smtplib.SMTP('localhost')
smtp.send_message(msg)
smtp.quit()
# Example usage
if not check_exabgp_running():
send_email_alert(
"CRITICAL: ExaBGP Down",
"ExaBGP process is not running on server-1"
)2. Slack Alerts:
import requests
def send_slack_alert(message, severity='warning'):
"""Send Slack alert"""
webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
colors = {
'critical': '#FF0000',
'warning': '#FFA500',
'info': '#0000FF'
}
payload = {
"attachments": [{
"color": colors.get(severity, '#808080'),
"title": f"ExaBGP Alert - {severity.upper()}",
"text": message,
"footer": "ExaBGP Monitoring",
"ts": int(time.time())
}]
}
requests.post(webhook_url, json=payload)
# Example
send_slack_alert("BGP session to 192.168.1.1 down!", severity='critical')3. PagerDuty:
import requests
def send_pagerduty_alert(description, severity='error'):
"""Trigger PagerDuty incident"""
url = "https://events.pagerduty.com/v2/enqueue"
payload = {
"routing_key": "YOUR_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": description,
"severity": severity, # critical, error, warning, info
"source": "exabgp-monitor",
}
}
requests.post(url, json=payload)
# Example
send_pagerduty_alert("ExaBGP process down!", severity='critical')Configure logrotate:
# /etc/logrotate.d/exabgp
/var/log/exabgp.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
postrotate
systemctl reload exabgp > /dev/null 2>&1 || true
endscript
}
Search for errors:
# Critical errors
grep -i "error\|critical\|fatal" /var/log/exabgp.log
# BGP session changes
grep "neighbor.*up\|neighbor.*down" /var/log/exabgp.log
# Route changes
grep "announce\|withdraw" /var/log/exabgp.log | tail -20Ship logs to ELK/Splunk:
Filebeat config (/etc/filebeat/filebeat.yml):
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/exabgp.log
fields:
service: exabgp
environment: production
output.elasticsearch:
hosts: ["localhost:9200"]
index: "exabgp-%{+yyyy.MM.dd}"Example panels:
Panel 1: ExaBGP Status
- Metric:
exabgp_up - Visualization: Stat
- Thresholds: 1 = green, 0 = red
Panel 2: BGP Sessions
- Metric:
exabgp_bgp_session_up{neighbor="*"} - Visualization: Table
- Show all neighbors with status
Panel 3: Routes Announced
- Metric:
exabgp_routes_announced - Visualization: Graph
- Time series of routes
Panel 4: CPU Usage
- Metric:
process_cpu_percent{job="exabgp"} - Visualization: Graph
Panel 5: Memory Usage
- Metric:
process_resident_memory_bytes{job="exabgp"} - Visualization: Graph
{
"dashboard": {
"title": "ExaBGP Monitoring",
"panels": [
{
"title": "ExaBGP Status",
"targets": [
{
"expr": "exabgp_up"
}
],
"type": "stat"
},
{
"title": "Routes Announced",
"targets": [
{
"expr": "exabgp_routes_announced"
}
],
"type": "graph"
}
]
}
}β
ExaBGP process health (on ExaBGP server)
β
BGP session state (on router via SNMP/SSH)
β
Route presence (on router)
β
End-to-end connectivity (client perspective)
Avoid alert fatigue:
# Good thresholds
CPU_WARNING = 80%
CPU_CRITICAL = 95%
MEMORY_WARNING = 80%
MEMORY_CRITICAL = 90%
BGP_SESSION_DOWN_THRESHOLD = 2 checks (dampening)Track over time:
- Route announcement rate
- BGP session uptime
- Resource usage trends
- Error rates
Synthetic monitoring:
#!/bin/bash
# End-to-end health check
# Check if service IP responds
if curl -sf http://100.10.0.100/health > /dev/null; then
echo "OK: Service responding"
else
echo "CRITICAL: Service not responding"
fiKnow what's normal:
- Typical CPU usage: 5-10%
- Typical memory: 50-100 MB
- Expected routes: 10
- BGP session uptime: > 30 days
- Debugging - Troubleshooting guide
- Service HA - HA patterns
- API Overview - API integration
- Prometheus - Metrics collection
- Grafana - Visualization
- Nagios - Traditional monitoring
Ready to set up monitoring? See Quick Start β
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)