Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
319
docs/runbooks/INCIDENT_RESPONSE.md
Normal file
319
docs/runbooks/INCIDENT_RESPONSE.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Incident Response Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
|
||||
|
||||
## Incident Severity Levels
|
||||
|
||||
### P0 - Critical (Immediate Response)
|
||||
- Complete service outage
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- **Response Time**: Immediate (< 5 minutes)
|
||||
- **Resolution Target**: < 1 hour
|
||||
|
||||
### P1 - High (Urgent Response)
|
||||
- Partial service outage affecting multiple users
|
||||
- Performance degradation > 50%
|
||||
- Authentication failures
|
||||
- **Response Time**: < 15 minutes
|
||||
- **Resolution Target**: < 4 hours
|
||||
|
||||
### P2 - Medium (Standard Response)
|
||||
- Single feature/service degraded
|
||||
- Performance degradation 20-50%
|
||||
- Non-critical errors
|
||||
- **Response Time**: < 1 hour
|
||||
- **Resolution Target**: < 24 hours
|
||||
|
||||
### P3 - Low (Normal Response)
|
||||
- Minor issues
|
||||
- Cosmetic problems
|
||||
- Non-blocking errors
|
||||
- **Response Time**: < 4 hours
|
||||
- **Resolution Target**: < 1 week
|
||||
|
||||
## Incident Response Process
|
||||
|
||||
### 1. Detection and Triage
|
||||
|
||||
#### Detection Sources
|
||||
- **Monitoring Alerts**: Prometheus/Alertmanager
|
||||
- **Error Logs**: Loki, application logs
|
||||
- **User Reports**: Support tickets, status page
|
||||
- **Health Checks**: Automated health check failures
|
||||
|
||||
#### Initial Triage Steps
|
||||
```bash
|
||||
# 1. Check service health
|
||||
kubectl get pods --all-namespaces | grep -v Running
|
||||
|
||||
# 2. Check API health
|
||||
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
|
||||
|
||||
# 3. Check portal health
|
||||
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
|
||||
|
||||
# 4. Check database connectivity
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
|
||||
|
||||
# 5. Check Keycloak
|
||||
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
|
||||
```
|
||||
|
||||
### 2. Incident Declaration
|
||||
|
||||
#### Create Incident Channel
|
||||
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
|
||||
- Invite: On-call engineer, Team lead, Product owner
|
||||
- Post initial status
|
||||
|
||||
#### Incident Template
|
||||
```
|
||||
INCIDENT: [Brief Description]
|
||||
SEVERITY: P0/P1/P2/P3
|
||||
STATUS: Investigating/Identified/Monitoring/Resolved
|
||||
START TIME: [Timestamp]
|
||||
AFFECTED SERVICES: [List]
|
||||
IMPACT: [User impact description]
|
||||
```
|
||||
|
||||
### 3. Investigation
|
||||
|
||||
#### Common Investigation Commands
|
||||
|
||||
**Check Pod Status**
|
||||
```bash
|
||||
kubectl get pods --all-namespaces -o wide
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl logs <pod-name> -n <namespace> --tail=100
|
||||
```
|
||||
|
||||
**Check Resource Usage**
|
||||
```bash
|
||||
kubectl top nodes
|
||||
kubectl top pods --all-namespaces
|
||||
```
|
||||
|
||||
**Check Database**
|
||||
```bash
|
||||
# Connection count
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# Long-running queries
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
||||
```
|
||||
|
||||
**Check Logs**
|
||||
```bash
|
||||
# Recent errors
|
||||
kubectl logs -n api deployment/api --tail=500 | grep -i error
|
||||
|
||||
# Authentication failures
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
|
||||
# Rate limiting
|
||||
kubectl logs -n api deployment/api | grep -i "rate limit"
|
||||
```
|
||||
|
||||
**Check Monitoring**
|
||||
```bash
|
||||
# Access Grafana
|
||||
open https://grafana.sankofa.nexus
|
||||
|
||||
# Check Prometheus alerts
|
||||
kubectl get prometheusrules -n monitoring
|
||||
```
|
||||
|
||||
### 4. Resolution
|
||||
|
||||
#### Common Resolution Actions
|
||||
|
||||
**Restart Service**
|
||||
```bash
|
||||
kubectl rollout restart deployment/api -n api
|
||||
kubectl rollout restart deployment/portal -n portal
|
||||
```
|
||||
|
||||
**Scale Up**
|
||||
```bash
|
||||
kubectl scale deployment/api --replicas=5 -n api
|
||||
```
|
||||
|
||||
**Rollback Deployment**
|
||||
```bash
|
||||
# See ROLLBACK_PLAN.md for detailed procedures
|
||||
kubectl rollout undo deployment/api -n api
|
||||
```
|
||||
|
||||
**Clear Rate Limits** (if needed)
|
||||
```bash
|
||||
# Access Redis/rate limit store and clear keys
|
||||
# Or restart rate limit service
|
||||
kubectl rollout restart deployment/rate-limit -n api
|
||||
```
|
||||
|
||||
**Database Maintenance**
|
||||
```bash
|
||||
# Vacuum database
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "VACUUM ANALYZE;"
|
||||
|
||||
# Kill long-running queries
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
|
||||
```
|
||||
|
||||
### 5. Post-Incident
|
||||
|
||||
#### Incident Report Template
|
||||
```markdown
|
||||
# Incident Report: [Date] - [Title]
|
||||
|
||||
## Summary
|
||||
[Brief description of incident]
|
||||
|
||||
## Timeline
|
||||
- [Time] - Incident detected
|
||||
- [Time] - Investigation started
|
||||
- [Time] - Root cause identified
|
||||
- [Time] - Resolution implemented
|
||||
- [Time] - Service restored
|
||||
|
||||
## Root Cause
|
||||
[Detailed root cause analysis]
|
||||
|
||||
## Impact
|
||||
- **Users Affected**: [Number]
|
||||
- **Duration**: [Time]
|
||||
- **Services Affected**: [List]
|
||||
|
||||
## Resolution
|
||||
[Steps taken to resolve]
|
||||
|
||||
## Prevention
|
||||
- [ ] Action item 1
|
||||
- [ ] Action item 2
|
||||
- [ ] Action item 3
|
||||
|
||||
## Follow-up
|
||||
- [ ] Update monitoring/alerts
|
||||
- [ ] Update runbooks
|
||||
- [ ] Code changes needed
|
||||
- [ ] Documentation updates
|
||||
```
|
||||
|
||||
## Common Incidents
|
||||
|
||||
### API High Latency
|
||||
|
||||
**Symptoms**: API response times > 500ms
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check database query performance
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
|
||||
|
||||
# Check API metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep http_request_duration
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Scale API replicas
|
||||
- Optimize slow queries
|
||||
- Add database indexes
|
||||
- Check for N+1 query problems
|
||||
|
||||
### Database Connection Pool Exhausted
|
||||
|
||||
**Symptoms**: "too many connections" errors
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Increase connection pool size
|
||||
- Kill idle connections
|
||||
- Scale database
|
||||
- Check for connection leaks
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptoms**: Users cannot log in
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check Keycloak
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Check API auth logs
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Restart Keycloak if needed
|
||||
- Check OIDC configuration
|
||||
- Verify JWT secret
|
||||
- Check network connectivity
|
||||
|
||||
### Portal Not Loading
|
||||
|
||||
**Symptoms**: Portal returns 500 or blank page
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check portal pods
|
||||
kubectl get pods -n portal
|
||||
kubectl logs -n portal deployment/portal --tail=100
|
||||
|
||||
# Check portal health
|
||||
curl https://portal.sankofa.nexus/api/health
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Restart portal deployment
|
||||
- Check environment variables
|
||||
- Verify Keycloak connectivity
|
||||
- Check build errors
|
||||
|
||||
## Escalation
|
||||
|
||||
### When to Escalate
|
||||
- P0 incident not resolved in 30 minutes
|
||||
- P1 incident not resolved in 2 hours
|
||||
- Need additional expertise
|
||||
- Customer impact is severe
|
||||
|
||||
### Escalation Path
|
||||
1. **On-call Engineer** → Team Lead
|
||||
2. **Team Lead** → Engineering Manager
|
||||
3. **Engineering Manager** → CTO/VP Engineering
|
||||
4. **CTO** → Executive Team
|
||||
|
||||
### Emergency Contacts
|
||||
- **On-call**: [Phone/Slack]
|
||||
- **Team Lead**: [Phone/Slack]
|
||||
- **Engineering Manager**: [Phone/Slack]
|
||||
- **CTO**: [Phone/Slack]
|
||||
|
||||
## Communication
|
||||
|
||||
### Status Page Updates
|
||||
- Update status page during incident
|
||||
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
|
||||
- Include: Status, affected services, estimated resolution time
|
||||
|
||||
### Customer Communication
|
||||
- For P0/P1: Notify affected customers immediately
|
||||
- For P2/P3: Include in next status update
|
||||
- Be transparent about impact and resolution timeline
|
||||
|
||||
Reference in New Issue
Block a user