8.5 KiB
Infrastructure Consolidation Plan
Date: 2025-01-27 Purpose: Plan for consolidating infrastructure across all projects Status: Implementation Plan
Executive Summary
This plan outlines the strategy for consolidating infrastructure services across 40+ projects, reducing costs by 30-40%, and improving operational efficiency.
Key Goals:
- Shared Kubernetes clusters (dev/staging/prod)
- Unified monitoring stack
- Shared database services
- Consolidated CI/CD infrastructure
- Unified ingress and networking
Current State Analysis
Infrastructure Distribution
Kubernetes Clusters:
- Multiple project-specific clusters
- Inconsistent configurations
- Duplicate infrastructure components
Databases:
- Separate PostgreSQL instances per project
- Separate Redis instances per project
- No shared database services
Monitoring:
- Project-specific Prometheus/Grafana instances
- Inconsistent logging solutions
- No centralized alerting
CI/CD:
- Project-specific pipelines
- Duplicate build infrastructure
- Inconsistent deployment patterns
Phase 1: Shared Kubernetes Infrastructure (Weeks 5-8)
1.1 Dev/Staging Cluster
Configuration:
- Cluster: K3s or RKE2 (lightweight, production-ready)
- Location: loc_az_hci Proxmox infrastructure
- Namespaces: One per project
- Resource Quotas: Per namespace
- Networking: Unified ingress (Traefik or NGINX)
Projects to Migrate:
- dbis_core (dev/staging)
- the_order (dev/staging)
- Sankofa (dev/staging)
- Web applications (dev/staging)
Benefits:
- Reduced infrastructure overhead
- Consistent deployment patterns
- Shared resources (CPU, memory)
- Unified networking
1.2 Production Cluster
Configuration:
- Cluster: K3s or RKE2 (high availability)
- Location: Multi-region (loc_az_hci + cloud)
- Namespaces: One per project with isolation
- Resource Limits: Strict quotas
- Networking: Unified ingress with SSL/TLS
Projects to Migrate:
- dbis_core (production)
- the_order (production)
- Sankofa (production)
- Critical web applications
Security:
- Network policies per namespace
- RBAC per namespace
- Secrets management (Vault)
- Pod security policies
Phase 2: Shared Database Services (Weeks 6-9)
2.1 PostgreSQL Clusters
Dev/Staging Cluster:
- Instances: 1 primary + 1 replica
- Multi-tenancy: Database per project
- Backup: Daily automated backups
- Monitoring: Shared Prometheus
Production Cluster:
- Instances: 1 primary + 2 replicas
- Multi-tenancy: Database per project with isolation
- Backup: Continuous backups + point-in-time recovery
- High Availability: Automatic failover
Projects to Migrate:
- dbis_core
- the_order
- Sankofa
- Other projects with PostgreSQL
Benefits:
- Reduced database overhead
- Centralized backup management
- Unified monitoring
- Easier maintenance
2.2 Redis Clusters
Dev/Staging Cluster:
- Instances: 1 Redis instance (multi-database)
- Usage: Caching, sessions, queues
- Monitoring: Shared Prometheus
Production Cluster:
- Instances: Redis Cluster (3+ nodes)
- High Availability: Automatic failover
- Persistence: AOF + RDB snapshots
- Monitoring: Shared Prometheus
Projects to Migrate:
- dbis_core
- the_order
- Other projects with Redis
Phase 3: Unified Monitoring Stack (Weeks 7-10)
3.1 Prometheus/Grafana
Deployment:
- Location: Shared Kubernetes cluster
- Storage: Persistent volumes (50-100 GB)
- Retention: 30 days (metrics)
- Scraping: All projects via service discovery
Configuration:
- Unified dashboards
- Project-specific dashboards
- Alert rules per project
- Centralized alerting
3.2 Logging (Loki/ELK)
Option 1: Loki (Recommended)
- Deployment: Shared Kubernetes cluster
- Storage: Object storage (MinIO, S3)
- Retention: 90 days
- Query: Grafana Loki
Option 2: ELK Stack
- Deployment: Separate cluster or VMs
- Storage: Elasticsearch cluster
- Retention: 90 days
- Query: Kibana
Configuration:
- Centralized log aggregation
- Project-specific log streams
- Log parsing and indexing
- Search and analysis
3.3 Alerting
System: Alertmanager (Prometheus)
- Channels: Email, Slack, PagerDuty
- Routing: Per project, per severity
- Grouping: Smart alert grouping
- Silencing: Alert silencing interface
Phase 4: Shared CI/CD Infrastructure (Weeks 8-11)
4.1 Container Registry
Option 1: Harbor (Recommended)
- Deployment: Shared Kubernetes cluster
- Features: Vulnerability scanning, replication
- Storage: Object storage backend
- Access: Project-based access control
Option 2: GitLab Container Registry
- Deployment: GitLab instance
- Features: Integrated with GitLab CI/CD
- Storage: Object storage backend
Configuration:
- Project-specific repositories
- Automated vulnerability scanning
- Image signing
- Retention policies
4.2 Build Infrastructure
Shared Build Runners:
- Type: Kubernetes runners (GitLab Runner, GitHub Actions Runner)
- Resources: Auto-scaling based on queue
- Caching: Shared build cache
- Isolation: Per-project isolation
Benefits:
- Reduced build infrastructure
- Faster builds (shared cache)
- Consistent build environment
- Centralized management
Phase 5: Unified Networking (Weeks 9-12)
5.1 Ingress Controller
Deployment: Traefik or NGINX Ingress Controller
- SSL/TLS: Cert-Manager with Let's Encrypt
- Routing: Per-project routing rules
- Load Balancing: Unified load balancing
- Rate Limiting: Per-project rate limits
5.2 Service Mesh (Optional)
Option: Istio or Linkerd
- Features: mTLS, traffic management, observability
- Benefits: Enhanced security, traffic control
- Complexity: Higher setup and maintenance
Resource Requirements
Shared Infrastructure Totals
Kubernetes Clusters:
- Dev/Staging: 50-100 CPU cores, 200-400 GB RAM
- Production: 100-200 CPU cores, 400-800 GB RAM
Database Services:
- PostgreSQL: 20-40 CPU cores, 100-200 GB RAM, 500 GB - 2 TB storage
- Redis: 8-16 CPU cores, 32-64 GB RAM, 100-200 GB storage
Monitoring Stack:
- Prometheus/Grafana: 8-16 CPU cores, 32-64 GB RAM, 500 GB - 1 TB storage
- Logging: 16-32 CPU cores, 64-128 GB RAM, 1-2 TB storage
CI/CD Infrastructure:
- Container Registry: 4-8 CPU cores, 16-32 GB RAM, 500 GB - 1 TB storage
- Build Runners: Auto-scaling (10-50 CPU cores peak)
Total Estimated Resources:
- CPU: 200-400 cores (shared)
- RAM: 800-1600 GB (shared)
- Storage: 3-6 TB (shared)
Cost Reduction: 30-40% compared to separate infrastructure
Migration Strategy
Phase 1: Preparation (Weeks 1-2)
- Design shared infrastructure architecture
- Plan resource allocation
- Create migration scripts
- Set up monitoring baseline
Phase 2: Dev/Staging (Weeks 3-6)
- Deploy shared dev/staging cluster
- Migrate 3-5 projects as pilot
- Set up shared databases (dev/staging)
- Deploy unified monitoring (dev/staging)
- Test and validate
Phase 3: Production (Weeks 7-12)
- Deploy shared production cluster
- Migrate projects to production cluster
- Set up shared databases (production)
- Deploy unified monitoring (production)
- Complete migration
Phase 4: Optimization (Weeks 13+)
- Optimize resource allocation
- Fine-tune monitoring and alerting
- Performance optimization
- Cost optimization
Security Considerations
Namespace Isolation
- Network policies per namespace
- RBAC per namespace
- Resource quotas per namespace
- Pod security policies
Secrets Management
- HashiCorp Vault or Kubernetes Secrets
- Encrypted at rest
- Encrypted in transit
- Rotation policies
Network Security
- mTLS between services (optional service mesh)
- Network policies
- Ingress with WAF
- DDoS protection
Monitoring and Alerting
Key Metrics
- Resource utilization (CPU, RAM, storage)
- Application performance (latency, throughput)
- Error rates
- Infrastructure health
Alerting Rules
- High resource utilization
- Service failures
- Security incidents
- Performance degradation
Success Metrics
- 30-40% reduction in infrastructure costs
- 80% of projects on shared infrastructure
- 50% reduction in duplicate services
- 99.9% uptime for shared services
- 50% faster deployment times
- Unified monitoring and alerting operational
Last Updated: 2025-01-27 Next Review: After Phase 1 completion