Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements

- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00
parent e01131efaf
commit 9daf1fd378
968 changed files with 160890 additions and 1092 deletions
--- a/docs/runbooks/INCIDENT_RESPONSE.md
+++ b/docs/runbooks/INCIDENT_RESPONSE.md
@@ -0,0 +1,319 @@
+# Incident Response Runbook
+
+## Overview
+
+This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
+
+## Incident Severity Levels
+
+### P0 - Critical (Immediate Response)
+- Complete service outage
+- Data loss or corruption
+- Security breach
+- **Response Time**: Immediate (< 5 minutes)
+- **Resolution Target**: < 1 hour
+
+### P1 - High (Urgent Response)
+- Partial service outage affecting multiple users
+- Performance degradation > 50%
+- Authentication failures
+- **Response Time**: < 15 minutes
+- **Resolution Target**: < 4 hours
+
+### P2 - Medium (Standard Response)
+- Single feature/service degraded
+- Performance degradation 20-50%
+- Non-critical errors
+- **Response Time**: < 1 hour
+- **Resolution Target**: < 24 hours
+
+### P3 - Low (Normal Response)
+- Minor issues
+- Cosmetic problems
+- Non-blocking errors
+- **Response Time**: < 4 hours
+- **Resolution Target**: < 1 week
+
+## Incident Response Process
+
+### 1. Detection and Triage
+
+#### Detection Sources
+- **Monitoring Alerts**: Prometheus/Alertmanager
+- **Error Logs**: Loki, application logs
+- **User Reports**: Support tickets, status page
+- **Health Checks**: Automated health check failures
+
+#### Initial Triage Steps
+```bash
+# 1. Check service health
+kubectl get pods --all-namespaces | grep -v Running
+
+# 2. Check API health
+curl -f https://api.sankofa.nexus/health || echo "API DOWN"
+
+# 3. Check portal health
+curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
+
+# 4. Check database connectivity
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
+
+# 5. Check Keycloak
+curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
+```
+
+### 2. Incident Declaration
+
+#### Create Incident Channel
+- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
+- Invite: On-call engineer, Team lead, Product owner
+- Post initial status
+
+#### Incident Template
+```
+INCIDENT: [Brief Description]
+SEVERITY: P0/P1/P2/P3
+STATUS: Investigating/Identified/Monitoring/Resolved
+START TIME: [Timestamp]
+AFFECTED SERVICES: [List]
+IMPACT: [User impact description]
+```
+
+### 3. Investigation
+
+#### Common Investigation Commands
+
+**Check Pod Status**
+```bash
+kubectl get pods --all-namespaces -o wide
+kubectl describe pod <pod-name> -n <namespace>
+kubectl logs <pod-name> -n <namespace> --tail=100
+```
+
+**Check Resource Usage**
+```bash
+kubectl top nodes
+kubectl top pods --all-namespaces
+```
+
+**Check Database**
+```bash
+# Connection count
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
+
+# Long-running queries
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
+```
+
+**Check Logs**
+```bash
+# Recent errors
+kubectl logs -n api deployment/api --tail=500 | grep -i error
+
+# Authentication failures
+kubectl logs -n api deployment/api | grep -i "auth.*fail"
+
+# Rate limiting
+kubectl logs -n api deployment/api | grep -i "rate limit"
+```
+
+**Check Monitoring**
+```bash
+# Access Grafana
+open https://grafana.sankofa.nexus
+
+# Check Prometheus alerts
+kubectl get prometheusrules -n monitoring
+```
+
+### 4. Resolution
+
+#### Common Resolution Actions
+
+**Restart Service**
+```bash
+kubectl rollout restart deployment/api -n api
+kubectl rollout restart deployment/portal -n portal
+```
+
+**Scale Up**
+```bash
+kubectl scale deployment/api --replicas=5 -n api
+```
+
+**Rollback Deployment**
+```bash
+# See ROLLBACK_PLAN.md for detailed procedures
+kubectl rollout undo deployment/api -n api
+```
+
+**Clear Rate Limits** (if needed)
+```bash
+# Access Redis/rate limit store and clear keys
+# Or restart rate limit service
+kubectl rollout restart deployment/rate-limit -n api
+```
+
+**Database Maintenance**
+```bash
+# Vacuum database
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "VACUUM ANALYZE;"
+
+# Kill long-running queries
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
+```
+
+### 5. Post-Incident
+
+#### Incident Report Template
+```markdown
+# Incident Report: [Date] - [Title]
+
+## Summary
+[Brief description of incident]
+
+## Timeline
+- [Time] - Incident detected
+- [Time] - Investigation started
+- [Time] - Root cause identified
+- [Time] - Resolution implemented
+- [Time] - Service restored
+
+## Root Cause
+[Detailed root cause analysis]
+
+## Impact
+- **Users Affected**: [Number]
+- **Duration**: [Time]
+- **Services Affected**: [List]
+
+## Resolution
+[Steps taken to resolve]
+
+## Prevention
+- [ ] Action item 1
+- [ ] Action item 2
+- [ ] Action item 3
+
+## Follow-up
+- [ ] Update monitoring/alerts
+- [ ] Update runbooks
+- [ ] Code changes needed
+- [ ] Documentation updates
+```
+
+## Common Incidents
+
+### API High Latency
+
+**Symptoms**: API response times > 500ms
+
+**Investigation**:
+```bash
+# Check database query performance
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
+
+# Check API metrics
+curl https://api.sankofa.nexus/metrics | grep http_request_duration
+```
+
+**Resolution**:
+- Scale API replicas
+- Optimize slow queries
+- Add database indexes
+- Check for N+1 query problems
+
+### Database Connection Pool Exhausted
+
+**Symptoms**: "too many connections" errors
+
+**Investigation**:
+```bash
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
+```
+
+**Resolution**:
+- Increase connection pool size
+- Kill idle connections
+- Scale database
+- Check for connection leaks
+
+### Authentication Failures
+
+**Symptoms**: Users cannot log in
+
+**Investigation**:
+```bash
+# Check Keycloak
+curl https://keycloak.sankofa.nexus/health
+kubectl logs -n keycloak deployment/keycloak --tail=100
+
+# Check API auth logs
+kubectl logs -n api deployment/api | grep -i "auth.*fail"
+```
+
+**Resolution**:
+- Restart Keycloak if needed
+- Check OIDC configuration
+- Verify JWT secret
+- Check network connectivity
+
+### Portal Not Loading
+
+**Symptoms**: Portal returns 500 or blank page
+
+**Investigation**:
+```bash
+# Check portal pods
+kubectl get pods -n portal
+kubectl logs -n portal deployment/portal --tail=100
+
+# Check portal health
+curl https://portal.sankofa.nexus/api/health
+```
+
+**Resolution**:
+- Restart portal deployment
+- Check environment variables
+- Verify Keycloak connectivity
+- Check build errors
+
+## Escalation
+
+### When to Escalate
+- P0 incident not resolved in 30 minutes
+- P1 incident not resolved in 2 hours
+- Need additional expertise
+- Customer impact is severe
+
+### Escalation Path
+1. **On-call Engineer** → Team Lead
+2. **Team Lead** → Engineering Manager
+3. **Engineering Manager** → CTO/VP Engineering
+4. **CTO** → Executive Team
+
+### Emergency Contacts
+- **On-call**: [Phone/Slack]
+- **Team Lead**: [Phone/Slack]
+- **Engineering Manager**: [Phone/Slack]
+- **CTO**: [Phone/Slack]
+
+## Communication
+
+### Status Page Updates
+- Update status page during incident
+- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
+- Include: Status, affected services, estimated resolution time
+
+### Customer Communication
+- For P0/P1: Notify affected customers immediately
+- For P2/P3: Include in next status update
+- Be transparent about impact and resolution timeline
+