Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
521
docs/TROUBLESHOOTING_GUIDE.md
Normal file
521
docs/TROUBLESHOOTING_GUIDE.md
Normal file
@@ -0,0 +1,521 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
Common issues and solutions for Sankofa Phoenix.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [API Issues](#api-issues)
|
||||
2. [Database Issues](#database-issues)
|
||||
3. [Authentication Issues](#authentication-issues)
|
||||
4. [Resource Provisioning](#resource-provisioning)
|
||||
5. [Billing Issues](#billing-issues)
|
||||
6. [Performance Issues](#performance-issues)
|
||||
7. [Deployment Issues](#deployment-issues)
|
||||
|
||||
## API Issues
|
||||
|
||||
### API Not Responding
|
||||
|
||||
**Symptoms:**
|
||||
- 503 Service Unavailable
|
||||
- Connection timeout
|
||||
- Health check fails
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api deployment/api --tail=100
|
||||
|
||||
# Check service
|
||||
kubectl get svc -n api api
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Restart API deployment:
|
||||
```bash
|
||||
kubectl rollout restart deployment/api -n api
|
||||
```
|
||||
|
||||
2. Check resource limits:
|
||||
```bash
|
||||
kubectl describe pod -n api -l app=api
|
||||
```
|
||||
|
||||
3. Verify database connection:
|
||||
```bash
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
```
|
||||
|
||||
### GraphQL Query Errors
|
||||
|
||||
**Symptoms:**
|
||||
- GraphQL errors in response
|
||||
- "Internal server error"
|
||||
- Query timeouts
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check API logs for errors
|
||||
kubectl logs -n api deployment/api | grep -i error
|
||||
|
||||
# Test GraphQL endpoint
|
||||
curl -X POST https://api.sankofa.nexus/graphql \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "{ health { status } }"}'
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check query syntax
|
||||
2. Verify authentication token
|
||||
3. Check database query performance
|
||||
4. Review resolver logs
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Symptoms:**
|
||||
- 429 Too Many Requests
|
||||
- Rate limit headers present
|
||||
|
||||
**Solutions:**
|
||||
1. Implement request batching
|
||||
2. Use subscriptions for real-time updates
|
||||
3. Request rate limit increase (admin)
|
||||
4. Implement client-side caching
|
||||
|
||||
## Database Issues
|
||||
|
||||
### Connection Pool Exhausted
|
||||
|
||||
**Symptoms:**
|
||||
- "Too many connections" errors
|
||||
- Slow query responses
|
||||
- Database connection timeouts
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check active connections
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
|
||||
|
||||
# Check connection pool metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep db_connections
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Increase connection pool size:
|
||||
```yaml
|
||||
env:
|
||||
- name: DB_POOL_SIZE
|
||||
value: "30"
|
||||
```
|
||||
|
||||
2. Close idle connections:
|
||||
```sql
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
|
||||
```
|
||||
|
||||
3. Restart API to reset connections
|
||||
|
||||
### Slow Queries
|
||||
|
||||
**Symptoms:**
|
||||
- High query latency
|
||||
- Timeout errors
|
||||
- Database CPU high
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Find slow queries
|
||||
SELECT query, mean_exec_time, calls
|
||||
FROM pg_stat_statements
|
||||
ORDER BY mean_exec_time DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Check table sizes
|
||||
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Add database indexes:
|
||||
```sql
|
||||
CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
|
||||
CREATE INDEX idx_resources_status ON resources(status);
|
||||
```
|
||||
|
||||
2. Analyze tables:
|
||||
```sql
|
||||
ANALYZE resources;
|
||||
```
|
||||
|
||||
3. Optimize queries
|
||||
4. Consider read replicas for heavy read workloads
|
||||
|
||||
### Database Lock Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Queries hanging
|
||||
- "Lock timeout" errors
|
||||
- Deadlock errors
|
||||
|
||||
**Solutions:**
|
||||
1. Check for long-running transactions:
|
||||
```sql
|
||||
SELECT pid, state, query, now() - xact_start AS duration
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active' AND xact_start IS NOT NULL
|
||||
ORDER BY duration DESC;
|
||||
```
|
||||
|
||||
2. Terminate blocking queries (if safe)
|
||||
3. Review transaction isolation levels
|
||||
4. Break up large transactions
|
||||
|
||||
## Authentication Issues
|
||||
|
||||
### Token Expired
|
||||
|
||||
**Symptoms:**
|
||||
- 401 Unauthorized
|
||||
- "Token expired" error
|
||||
- Keycloak errors
|
||||
|
||||
**Solutions:**
|
||||
1. Refresh token via Keycloak
|
||||
2. Re-authenticate
|
||||
3. Check token expiration settings in Keycloak
|
||||
|
||||
### Invalid Token
|
||||
|
||||
**Symptoms:**
|
||||
- 401 Unauthorized
|
||||
- "Invalid token" error
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Verify Keycloak is accessible
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
|
||||
# Check Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify token format
|
||||
2. Check Keycloak client configuration
|
||||
3. Verify token signature
|
||||
4. Check clock synchronization
|
||||
|
||||
### Permission Denied
|
||||
|
||||
**Symptoms:**
|
||||
- 403 Forbidden
|
||||
- "Access denied" error
|
||||
|
||||
**Solutions:**
|
||||
1. Verify user role in Keycloak
|
||||
2. Check tenant context
|
||||
3. Review RBAC policies
|
||||
4. Verify resource ownership
|
||||
|
||||
## Resource Provisioning
|
||||
|
||||
### VM Creation Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Resource stuck in PENDING
|
||||
- Proxmox errors
|
||||
- Crossplane errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Crossplane provider
|
||||
kubectl get pods -n crossplane-system | grep proxmox
|
||||
|
||||
# Check ProxmoxVM resource
|
||||
kubectl describe proxmoxvm -n default test-vm
|
||||
|
||||
# Check Proxmox connectivity
|
||||
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
||||
curl https://proxmox-endpoint:8006/api2/json/version
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify Proxmox credentials
|
||||
2. Check Proxmox node availability
|
||||
3. Verify resource quotas
|
||||
4. Check Crossplane provider logs
|
||||
|
||||
### Resource Update Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Update mutation fails
|
||||
- Resource not updating
|
||||
- Status mismatch
|
||||
|
||||
**Solutions:**
|
||||
1. Check resource state
|
||||
2. Verify update permissions
|
||||
3. Review resource constraints
|
||||
4. Check for conflicting updates
|
||||
|
||||
## Billing Issues
|
||||
|
||||
### Incorrect Costs
|
||||
|
||||
**Symptoms:**
|
||||
- Unexpected charges
|
||||
- Missing usage records
|
||||
- Cost discrepancies
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Check usage records
|
||||
SELECT * FROM usage_records
|
||||
WHERE tenant_id = 'tenant-id'
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 100;
|
||||
|
||||
-- Check billing calculations
|
||||
SELECT * FROM invoices
|
||||
WHERE tenant_id = 'tenant-id'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Review usage records
|
||||
2. Verify pricing configuration
|
||||
3. Check for duplicate records
|
||||
4. Recalculate costs if needed
|
||||
|
||||
### Budget Alerts Not Triggering
|
||||
|
||||
**Symptoms:**
|
||||
- Budget exceeded but no alert
|
||||
- Alerts not sent
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Check budget status
|
||||
SELECT * FROM budgets
|
||||
WHERE tenant_id = 'tenant-id';
|
||||
|
||||
-- Check alert configuration
|
||||
SELECT * FROM billing_alerts
|
||||
WHERE tenant_id = 'tenant-id' AND enabled = true;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify alert configuration
|
||||
2. Check alert evaluation schedule
|
||||
3. Review notification channels
|
||||
4. Test alert manually
|
||||
|
||||
### Invoice Generation Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Invoice creation error
|
||||
- Missing line items
|
||||
- PDF generation fails
|
||||
|
||||
**Solutions:**
|
||||
1. Check usage records exist
|
||||
2. Verify billing period
|
||||
3. Check PDF service
|
||||
4. Review invoice template
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### High Latency
|
||||
|
||||
**Symptoms:**
|
||||
- Slow API responses
|
||||
- Timeout errors
|
||||
- High P95 latency
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check API metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep request_duration
|
||||
|
||||
# Check database performance
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Add caching layer
|
||||
2. Optimize database queries
|
||||
3. Scale API horizontally
|
||||
4. Review N+1 query problems
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
**Symptoms:**
|
||||
- OOM kills
|
||||
- Pod restarts
|
||||
- Memory warnings
|
||||
|
||||
**Solutions:**
|
||||
1. Increase memory limits
|
||||
2. Review memory leaks
|
||||
3. Optimize data structures
|
||||
4. Implement pagination
|
||||
|
||||
### High CPU Usage
|
||||
|
||||
**Symptoms:**
|
||||
- Slow responses
|
||||
- CPU throttling
|
||||
- Pod evictions
|
||||
|
||||
**Solutions:**
|
||||
1. Scale horizontally
|
||||
2. Optimize algorithms
|
||||
3. Add caching
|
||||
4. Review expensive operations
|
||||
|
||||
## Deployment Issues
|
||||
|
||||
### Pods Not Starting
|
||||
|
||||
**Symptoms:**
|
||||
- Pods in Pending/CrashLoopBackOff
|
||||
- Image pull errors
|
||||
- Init container failures
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl describe pod -n api <pod-name>
|
||||
|
||||
# Check events
|
||||
kubectl get events -n api --sort-by='.lastTimestamp'
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api <pod-name>
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check image availability
|
||||
2. Verify resource requests/limits
|
||||
3. Check node resources
|
||||
4. Review init container logs
|
||||
|
||||
### Service Not Accessible
|
||||
|
||||
**Symptoms:**
|
||||
- Service unreachable
|
||||
- DNS resolution fails
|
||||
- Ingress errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check service
|
||||
kubectl get svc -n api
|
||||
|
||||
# Check ingress
|
||||
kubectl describe ingress -n api api
|
||||
|
||||
# Test service directly
|
||||
kubectl port-forward -n api svc/api 8080:80
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify service selector matches pods
|
||||
2. Check ingress configuration
|
||||
3. Verify DNS records
|
||||
4. Check network policies
|
||||
|
||||
### Configuration Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Wrong environment variables
|
||||
- Missing secrets
|
||||
- ConfigMap errors
|
||||
|
||||
**Solutions:**
|
||||
1. Verify environment variables:
|
||||
```bash
|
||||
kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
|
||||
```
|
||||
|
||||
2. Check secrets:
|
||||
```bash
|
||||
kubectl get secrets -n api
|
||||
```
|
||||
|
||||
3. Review ConfigMaps:
|
||||
```bash
|
||||
kubectl get configmaps -n api
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# API logs
|
||||
kubectl logs -n api deployment/api --tail=100 -f
|
||||
|
||||
# Database logs
|
||||
kubectl logs -n api deployment/postgres --tail=100
|
||||
|
||||
# Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Crossplane logs
|
||||
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
```bash
|
||||
# Prometheus queries
|
||||
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
|
||||
|
||||
# Grafana dashboards
|
||||
# Access: https://grafana.sankofa.nexus
|
||||
```
|
||||
|
||||
### Support
|
||||
|
||||
- **Documentation**: See `docs/` directory
|
||||
- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
|
||||
- **API Documentation**: `docs/API_DOCUMENTATION.md`
|
||||
|
||||
## Common Error Messages
|
||||
|
||||
### "Database connection failed"
|
||||
- Check database pod status
|
||||
- Verify connection string
|
||||
- Check network policies
|
||||
|
||||
### "Authentication required"
|
||||
- Verify token in request
|
||||
- Check token expiration
|
||||
- Verify Keycloak is accessible
|
||||
|
||||
### "Quota exceeded"
|
||||
- Review tenant quotas
|
||||
- Check resource usage
|
||||
- Request quota increase
|
||||
|
||||
### "Resource not found"
|
||||
- Verify resource ID
|
||||
- Check tenant context
|
||||
- Review access permissions
|
||||
|
||||
### "Internal server error"
|
||||
- Check application logs
|
||||
- Review error details
|
||||
- Check system resources
|
||||
|
||||
Reference in New Issue
Block a user