- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
7.8 KiB
7.8 KiB
Monitoring and Observability Guide
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
Overview
Sankofa Phoenix uses a comprehensive monitoring stack:
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Loki: Log aggregation
- Alertmanager: Alert routing and notification
Tenant-Aware Metrics
All metrics are tagged with tenant IDs for multi-tenant isolation.
Metric Naming Convention
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
Examples:
sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}
Grafana Dashboards
1. System Overview Dashboard
Location: grafana/dashboards/system-overview.json
Metrics:
- API request rate and latency
- Database connection pool usage
- Keycloak authentication rate
- System resource usage (CPU, memory, disk)
Panels:
- Request rate (requests/sec)
- P95 latency (ms)
- Error rate (%)
- Active connections
- Authentication success rate
2. Tenant Dashboard
Location: grafana/dashboards/tenant-overview.json
Metrics:
- Tenant resource usage
- Tenant cost tracking
- Tenant API usage
- Tenant user activity
Panels:
- Resource usage by tenant
- Cost breakdown by tenant
- API calls by tenant
- Active users by tenant
3. Billing Dashboard
Location: grafana/dashboards/billing.json
Metrics:
- Real-time cost tracking
- Cost by service/resource
- Budget vs actual spend
- Cost forecast
- Billing anomalies
Panels:
- Current month cost
- Cost trend (7d, 30d)
- Top resources by cost
- Budget utilization
- Anomaly detection alerts
4. Proxmox Infrastructure Dashboard
Location: grafana/dashboards/proxmox-infrastructure.json
Metrics:
- VM status and health
- Node resource usage
- Storage utilization
- Network throughput
- VM creation/deletion rate
Panels:
- VM status overview
- Node CPU/memory usage
- Storage pool usage
- Network I/O
- VM lifecycle events
5. Security Dashboard
Location: grafana/dashboards/security.json
Metrics:
- Authentication events
- Failed login attempts
- Policy violations
- Incident response metrics
- Audit log events
Panels:
- Authentication success/failure rate
- Policy violations by severity
- Incident response time
- Audit log volume
- Security events timeline
Prometheus Configuration
Scrape Configs
scrape_configs:
- job_name: 'sankofa-api'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- api
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: api
metric_relabel_configs:
- source_labels: [tenant_id]
target_label: tenant_id
regex: '(.+)'
replacement: '${1}'
- job_name: 'proxmox'
static_configs:
- targets:
- proxmox-exporter:9091
relabel_configs:
- source_labels: [__address__]
target_label: instance
Recording Rules
groups:
- name: sankofa_rules
interval: 30s
rules:
- record: sankofa:api:requests:rate5m
expr: rate(sankofa_api_requests_total[5m])
- record: sankofa:billing:cost:rate1h
expr: rate(sankofa_billing_cost_usd[1h])
- record: sankofa:proxmox:vm:count
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
Alerting Rules
Critical Alerts
groups:
- name: sankofa_critical
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: DatabaseConnectionPoolExhausted
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
- alert: BudgetExceeded
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
for: 1h
labels:
severity: warning
annotations:
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
- alert: ProxmoxNodeDown
expr: up{job="proxmox"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.instance }} is down"
Billing Anomaly Detection
- name: sankofa_billing_anomalies
interval: 1h
rules:
- alert: CostAnomalyDetected
expr: |
(
sankofa_billing_cost_usd
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
for: 2h
labels:
severity: warning
annotations:
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
Real-Time Cost Tracking
Metrics Exposed
sankofa_billing_cost_usd{tenant_id, service, resource_id}- Current costsankofa_billing_cost_rate_usd_per_hour{tenant_id}- Cost ratesankofa_billing_budget_usd{tenant_id}- Budget limitsankofa_billing_budget_utilization_percent{tenant_id}- Budget usage %
Grafana Query Example
# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)
# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
Log Aggregation
Loki Configuration
Logs are collected with tenant context:
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: ${TENANT_ID}
Log Labels
tenant_id: Tenant identifierservice: Service name (api, portal, etc.)level: Log level (info, warn, error)component: Component name
Log Queries
# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}
# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h
# Authentication failures
{component="auth"} | json | status="failed"
Deployment
Install Monitoring Stack
# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values grafana/values.yaml
# Apply custom dashboards
kubectl apply -f grafana/dashboards/
Import Dashboards
# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
kubectl create configmap $(basename $dashboard .json) \
--from-file=$dashboard \
--namespace=monitoring \
--dry-run=client -o yaml | kubectl apply -f -
done
Access
- Grafana: https://grafana.sankofa.nexus
- Prometheus: https://prometheus.sankofa.nexus
- Alertmanager: https://alertmanager.sankofa.nexus
Default credentials (change immediately):
- Username:
admin - Password: (from secret
monitoring-grafana)
Best Practices
- Tenant Isolation: Always filter metrics by tenant_id
- Retention: Configure appropriate retention periods
- Cardinality: Avoid high-cardinality labels
- Alerts: Set up alerting for critical metrics
- Dashboards: Create tenant-specific dashboards
- Cost Tracking: Monitor billing metrics closely
- Anomaly Detection: Enable anomaly detection for billing
References
- Dashboard definitions:
grafana/dashboards/ - Prometheus config:
monitoring/prometheus/ - Alert rules:
monitoring/alerts/