SkyLink Security Monitoring Guide

Comprehensive monitoring with Prometheus alerting and Grafana dashboards

Overview
Quick Start
Architecture
Prometheus Configuration
Alert Rules
Grafana Dashboards
Metrics Reference
Operational Procedures
Troubleshooting

1. Overview

SkyLink includes a production-ready monitoring stack designed for security observability:

Component	Purpose	Port
Prometheus	Metrics collection, alerting rules	9090
Grafana	Visualization, dashboards	3000

Key Features

Pre-configured security alerts for authentication failures, rate limiting, and errors
Auto-provisioned Grafana dashboards (no manual setup required)
Declarative configuration (infrastructure as code)
Docker Compose profile for optional deployment

2. Quick Start

Start Monitoring Stack

# Start all services including monitoring
docker compose --profile monitoring up -d

# Or start monitoring separately
docker compose --profile monitoring up -d prometheus grafana

Access Dashboards

Dashboard	URL	Credentials
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	(no auth)

Verify Setup

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

# Check Grafana health
curl -s http://localhost:3000/api/health

3. Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         MONITORING STACK                                 │
│                                                                          │
│  ┌─────────────────┐         ┌─────────────────┐                        │
│  │     GRAFANA     │◄────────│   PROMETHEUS    │                        │
│  │     :3000       │  Query  │     :9090       │                        │
│  │                 │         │                 │                        │
│  │  ┌───────────┐  │         │  ┌───────────┐  │                        │
│  │  │ Dashboard │  │         │  │  Alerts   │  │                        │
│  │  │ Security  │  │         │  │ security  │  │                        │
│  │  └───────────┘  │         │  │ .yml      │  │                        │
│  └─────────────────┘         └────────┬────────┘                        │
│                                       │                                  │
│                              Scrape /metrics                            │
│                                       │                                  │
│         ┌─────────────────────────────┼─────────────────────────────┐   │
│         │                             │                             │   │
│         ▼                             ▼                             ▼   │
│  ┌─────────────┐              ┌─────────────┐              ┌───────────┐│
│  │   Gateway   │              │  Telemetry  │              │  Weather  ││
│  │    :8000    │              │    :8001    │              │   :8002   ││
│  │  /metrics   │              │  /metrics   │              │ /metrics  ││
│  └─────────────┘              └─────────────┘              └───────────┘│
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow

Services expose metrics at /metrics endpoint (Prometheus format)
Prometheus scrapes metrics every 10-15 seconds
Alert rules evaluate metrics and fire alerts when conditions are met
Grafana queries Prometheus for visualization

4. Prometheus Configuration

Configuration File

Location: monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  - job_name: 'skylink-gateway'
    static_configs:
      - targets: ['gateway:8000']
    metrics_path: /metrics
    scrape_interval: 10s

Scrape Targets

Job	Target	Interval
skylink-gateway	gateway:8000	10s
skylink-telemetry	telemetry:8001	10s
skylink-weather	weather:8002	10s
skylink-contacts	contacts:8003	10s
prometheus	localhost:9090	15s

Verify Targets

# List all targets and their health
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

5. Alert Rules

Security Alerts

Location: monitoring/prometheus/alerts/security.yml

Alert	Severity	Condition	Description
HighAuthFailureRate	warning	>0.1 401/s for 2min	Potential brute force
SustainedAuthFailures	critical	>0.05 401/s for 10min	Sustained attack
mTLSValidationFailures	critical	Any 403 for 1min	mTLS bypass attempt
RateLimitAbuse	warning	>1 429/s for 5min	Rate limit abuse
RateLimitFlood	critical	>10 429/s for 2min	DDoS/flood attack
HighErrorRate	critical	>5% 5xx for 5min	Service degradation
ServiceDown	critical	up == 0 for 1min	Service unavailable

View Active Alerts

# Check firing alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

# Check all alert rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

Alert Rule Example

- alert: HighAuthFailureRate
  expr: sum(rate(http_requests_total{status="401"}[5m])) > 0.1
  for: 2m
  labels:
    severity: warning
    category: authentication
  annotations:
    summary: "High authentication failure rate detected"
    description: "Auth failures: {{ $value | printf \"%.2f\" }}/s"

6. Grafana Dashboards

Pre-configured Dashboard

SkyLink Security Dashboard (UID: skylink-security)

Access: http://localhost:3000/d/skylink-security

Dashboard Panels

Row 1: Authentication & Authorization

Panel	Type	Metric
Auth Success Rate	Gauge	% of non-401 responses
Client Errors by Status	Pie	Distribution of 4xx
Authentication Failures	Time Series	401/403 rate
mTLS Failures (1h)	Stat	Total 403 count

Row 2: API Security

Panel	Type	Metric
Rate Limited Requests	Time Series	429 rate
Security Responses	Stacked Area	401/403/429

Row 3: Performance & Latency

Panel	Type	Metric
Request Latency	Time Series	p50/p95/p99
Error Rate (5xx)	Gauge	Server error %
Request Rate	Time Series	Total RPS

Row 4: Service Health

Panel	Type	Metric
Service Status	Stat	UP/DOWN per service
Request Rate by Service	Bar	RPS per job

Dashboard Provisioning

Dashboards are auto-provisioned from:

monitoring/grafana/dashboards/security.json

Configuration in:

monitoring/grafana/provisioning/dashboards/dashboards.yml

7. Metrics Reference

HTTP Metrics (from prometheus-fastapi-instrumentator)

Metric	Type	Labels	Description
`http_requests_total`	Counter	method, status, handler	Total requests
`http_request_duration_seconds`	Histogram	method, handler	Request latency
`http_requests_inprogress`	Gauge	method, handler	In-flight requests

Security-Relevant Queries

# Authentication failure rate
sum(rate(http_requests_total{status="401"}[5m]))

# mTLS validation failures
sum(rate(http_requests_total{status="403"}[5m]))

# Rate limit hits
sum(rate(http_requests_total{status="429"}[5m]))

# Error rate percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Request rate by service
sum by (job) (rate(http_requests_total[5m]))

8. Operational Procedures

Starting Monitoring

# Start with the application
docker compose --profile monitoring up -d

# Start monitoring only (if app is already running)
docker compose --profile monitoring up -d prometheus grafana

Stopping Monitoring

# Stop monitoring only
docker compose --profile monitoring down

# Stop everything
docker compose --profile monitoring down -v

Updating Alert Rules

Edit monitoring/prometheus/alerts/security.yml

Reload Prometheus configuration:

curl -X POST http://localhost:9090/-/reload

Verify rules loaded:

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules | length'

Adding New Dashboards

Create dashboard JSON in monitoring/grafana/dashboards/
Dashboard will be auto-provisioned within 30 seconds
Or restart Grafana: docker compose restart grafana

Backup and Restore

# Backup Prometheus data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz /data

# Backup Grafana data
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz /data

9. Troubleshooting

Prometheus Not Scraping

# Check target health
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastError: .lastError}'

# Common issues:
# - Service not running: check docker compose ps
# - Network issue: ensure services are on skylink-net
# - /metrics not exposed: check service configuration

Grafana Dashboard Not Loading

# Check Grafana logs
docker compose logs grafana | tail -50

# Verify datasource
curl -s http://localhost:3000/api/datasources | jq '.[].name'

# Common issues:
# - Prometheus not reachable: check prometheus container
# - Dashboard JSON invalid: validate with jq
# - Datasource UID mismatch: see below

Dashboard Shows "No data"

Expected behavior for security panels:

Some panels may show "No data" when there are no security events. This is normal:

Panel	Shows "No data" when...
Authentication Failures	No 401 errors (good!)
mTLS Failures	No 403 errors (good!)
Rate Limited Requests	No 429 errors (good!)
Client Errors by Status	No 4xx errors (good!)
Security Responses	No 401/403/429 errors

Panels that should always show data:

Auth Success Rate: Shows 100% when no 401 errors
Error Rate (5xx): Shows 0% when no 5xx errors
Request Latency: Shows latency if any traffic exists
Request Rate: Shows RPS if any traffic exists
Service Status: Shows UP/DOWN for all services

Datasource UID Mismatch:

If ALL panels show "No data", check the datasource UID:

# Check datasource UID in Grafana
curl -s http://localhost:3000/api/datasources | jq '.[0].uid'

# Should return: "prometheus"
# If different, the dashboard won't find the datasource

The datasource provisioning file must include explicit UID:

# monitoring/grafana/provisioning/datasources/prometheus.yml
datasources:
  - name: Prometheus
    uid: prometheus  # <-- Must match dashboard references
    type: prometheus
    ...

To generate test security events:

# Generate 401 errors (invalid token)
curl -s http://localhost:8000/telemetry/health -H "Authorization: Bearer invalid"

# Generate 429 errors (exceed rate limit)
for i in $(seq 1 70); do
  curl -s "http://localhost:8000/weather/current?lat=48&lon=2" \
    -H "Authorization: Bearer $TOKEN" > /dev/null
done

Alerts Not Firing

# Check alert rule status
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, health: .health, lastError: .lastError}'

# Check if metrics exist
curl -s 'http://localhost:9090/api/v1/query?query=http_requests_total' | jq '.data.result | length'

# Common issues:
# - No traffic: alerts need traffic to evaluate
# - Wrong expression: test in Prometheus UI

High Memory Usage

# Check Prometheus memory
docker stats prometheus

# Reduce retention (default 15d)
# Add to prometheus command: --storage.tsdb.retention.time=7d

Appendix A: Environment Variables

Variable	Default	Description
`GRAFANA_PASSWORD`	admin	Grafana admin password

Appendix B: Ports Reference

Service	Internal	External	Protocol
Gateway	8000	8000	HTTP
Telemetry	8001	-	HTTP
Weather	8002	-	HTTP
Contacts	8003	-	HTTP
Prometheus	9090	9090	HTTP
Grafana	3000	3000	HTTP

Appendix C: Kubernetes Monitoring

For Kubernetes deployments, the Helm chart includes a ServiceMonitor for Prometheus Operator integration:

# values.yaml
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    labels:
      release: prometheus  # Must match Prometheus Operator selector

Kubernetes Metrics Endpoints:

Service	Path	Port
Gateway	/metrics	8000
Telemetry	/metrics	8001
Weather	/metrics	8002
Contacts	/metrics	8003

See KUBERNETES.md for full deployment guide.

Document maintained as part of SkyLink Security by Design implementation.

FilesExpand file tree

MONITORING.md

Latest commit

History