# Monitoring and Observability Guide This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix. ## Overview Sankofa Phoenix uses a comprehensive monitoring stack: - **Prometheus**: Metrics collection and storage - **Grafana**: Visualization and dashboards - **Loki**: Log aggregation - **Alertmanager**: Alert routing and notification ## Tenant-Aware Metrics All metrics are tagged with tenant IDs for multi-tenant isolation. ### Metric Naming Convention ``` sankofa___{tenant_id="",...} ``` Examples: - `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}` - `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}` - `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}` ## Grafana Dashboards ### 1. System Overview Dashboard **Location**: `grafana/dashboards/system-overview.json` **Metrics**: - API request rate and latency - Database connection pool usage - Keycloak authentication rate - System resource usage (CPU, memory, disk) **Panels**: - Request rate (requests/sec) - P95 latency (ms) - Error rate (%) - Active connections - Authentication success rate ### 2. Tenant Dashboard **Location**: `grafana/dashboards/tenant-overview.json` **Metrics**: - Tenant resource usage - Tenant cost tracking - Tenant API usage - Tenant user activity **Panels**: - Resource usage by tenant - Cost breakdown by tenant - API calls by tenant - Active users by tenant ### 3. Billing Dashboard **Location**: `grafana/dashboards/billing.json` **Metrics**: - Real-time cost tracking - Cost by service/resource - Budget vs actual spend - Cost forecast - Billing anomalies **Panels**: - Current month cost - Cost trend (7d, 30d) - Top resources by cost - Budget utilization - Anomaly detection alerts ### 4. Proxmox Infrastructure Dashboard **Location**: `grafana/dashboards/proxmox-infrastructure.json` **Metrics**: - VM status and health - Node resource usage - Storage utilization - Network throughput - VM creation/deletion rate **Panels**: - VM status overview - Node CPU/memory usage - Storage pool usage - Network I/O - VM lifecycle events ### 5. Security Dashboard **Location**: `grafana/dashboards/security.json` **Metrics**: - Authentication events - Failed login attempts - Policy violations - Incident response metrics - Audit log events **Panels**: - Authentication success/failure rate - Policy violations by severity - Incident response time - Audit log volume - Security events timeline ## Prometheus Configuration ### Scrape Configs ```yaml scrape_configs: - job_name: 'sankofa-api' kubernetes_sd_configs: - role: pod namespaces: names: - api relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: api metric_relabel_configs: - source_labels: [tenant_id] target_label: tenant_id regex: '(.+)' replacement: '${1}' - job_name: 'proxmox' static_configs: - targets: - proxmox-exporter:9091 relabel_configs: - source_labels: [__address__] target_label: instance ``` ### Recording Rules ```yaml groups: - name: sankofa_rules interval: 30s rules: - record: sankofa:api:requests:rate5m expr: rate(sankofa_api_requests_total[5m]) - record: sankofa:billing:cost:rate1h expr: rate(sankofa_billing_cost_usd[1h]) - record: sankofa:proxmox:vm:count expr: count(sankofa_proxmox_vm_info) by (tenant_id) ``` ## Alerting Rules ### Critical Alerts ```yaml groups: - name: sankofa_critical interval: 30s rules: - alert: HighErrorRate expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} errors/sec" - alert: DatabaseConnectionPoolExhausted expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9 for: 2m labels: severity: critical annotations: summary: "Database connection pool nearly exhausted" - alert: BudgetExceeded expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0 for: 1h labels: severity: warning annotations: summary: "Budget exceeded for tenant {{ $labels.tenant_id }}" - alert: ProxmoxNodeDown expr: up{job="proxmox"} == 0 for: 5m labels: severity: critical annotations: summary: "Proxmox node {{ $labels.instance }} is down" ``` ### Billing Anomaly Detection ```yaml - name: sankofa_billing_anomalies interval: 1h rules: - alert: CostAnomalyDetected expr: | ( sankofa_billing_cost_usd - predict_linear(sankofa_billing_cost_usd[7d], 3600) ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5 for: 2h labels: severity: warning annotations: summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}" ``` ## Real-Time Cost Tracking ### Metrics Exposed - `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost - `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate - `sankofa_billing_budget_usd{tenant_id}` - Budget limit - `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage % ### Grafana Query Example ```promql # Current month cost by tenant sum(sankofa_billing_cost_usd) by (tenant_id) # Cost trend (7 days) rate(sankofa_billing_cost_usd[1h]) * 24 * 7 # Budget utilization sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100 ``` ## Log Aggregation ### Loki Configuration Logs are collected with tenant context: ```yaml clients: - url: http://loki:3100/loki/api/v1/push tenant_id: ${TENANT_ID} ``` ### Log Labels - `tenant_id`: Tenant identifier - `service`: Service name (api, portal, etc.) - `level`: Log level (info, warn, error) - `component`: Component name ### Log Queries ```logql # Errors for a specific tenant {tenant_id="tenant-1", level="error"} # API errors in last hour {service="api", level="error"} | json | timestamp > now() - 1h # Authentication failures {component="auth"} | json | status="failed" ``` ## Deployment ### Install Monitoring Stack ```bash # Add Prometheus Operator Helm repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install kube-prometheus-stack helm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --values grafana/values.yaml # Apply custom dashboards kubectl apply -f grafana/dashboards/ ``` ### Import Dashboards ```bash # Import all dashboards for dashboard in grafana/dashboards/*.json; do kubectl create configmap $(basename $dashboard .json) \ --from-file=$dashboard \ --namespace=monitoring \ --dry-run=client -o yaml | kubectl apply -f - done ``` ## Access - **Grafana**: https://grafana.sankofa.nexus - **Prometheus**: https://prometheus.sankofa.nexus - **Alertmanager**: https://alertmanager.sankofa.nexus Default credentials (change immediately): - Username: `admin` - Password: (from secret `monitoring-grafana`) ## Best Practices 1. **Tenant Isolation**: Always filter metrics by tenant_id 2. **Retention**: Configure appropriate retention periods 3. **Cardinality**: Avoid high-cardinality labels 4. **Alerts**: Set up alerting for critical metrics 5. **Dashboards**: Create tenant-specific dashboards 6. **Cost Tracking**: Monitor billing metrics closely 7. **Anomaly Detection**: Enable anomaly detection for billing ## References - Dashboard definitions: `grafana/dashboards/` - Prometheus config: `monitoring/prometheus/` - Alert rules: `monitoring/alerts/`