# Infrastructure Consolidation Plan **Date**: 2025-01-27 **Purpose**: Plan for consolidating infrastructure across all projects **Status**: Implementation Plan --- ## Executive Summary This plan outlines the strategy for consolidating infrastructure services across 40+ projects, reducing costs by 30-40%, and improving operational efficiency. **Key Goals**: - Shared Kubernetes clusters (dev/staging/prod) - Unified monitoring stack - Shared database services - Consolidated CI/CD infrastructure - Unified ingress and networking --- ## Current State Analysis ### Infrastructure Distribution **Kubernetes Clusters**: - Multiple project-specific clusters - Inconsistent configurations - Duplicate infrastructure components **Databases**: - Separate PostgreSQL instances per project - Separate Redis instances per project - No shared database services **Monitoring**: - Project-specific Prometheus/Grafana instances - Inconsistent logging solutions - No centralized alerting **CI/CD**: - Project-specific pipelines - Duplicate build infrastructure - Inconsistent deployment patterns --- ## Phase 1: Shared Kubernetes Infrastructure (Weeks 5-8) ### 1.1 Dev/Staging Cluster **Configuration**: - **Cluster**: K3s or RKE2 (lightweight, production-ready) - **Location**: loc_az_hci Proxmox infrastructure - **Namespaces**: One per project - **Resource Quotas**: Per namespace - **Networking**: Unified ingress (Traefik or NGINX) **Projects to Migrate**: - dbis_core (dev/staging) - the_order (dev/staging) - Sankofa (dev/staging) - Web applications (dev/staging) **Benefits**: - Reduced infrastructure overhead - Consistent deployment patterns - Shared resources (CPU, memory) - Unified networking ### 1.2 Production Cluster **Configuration**: - **Cluster**: K3s or RKE2 (high availability) - **Location**: Multi-region (loc_az_hci + cloud) - **Namespaces**: One per project with isolation - **Resource Limits**: Strict quotas - **Networking**: Unified ingress with SSL/TLS **Projects to Migrate**: - dbis_core (production) - the_order (production) - Sankofa (production) - Critical web applications **Security**: - Network policies per namespace - RBAC per namespace - Secrets management (Vault) - Pod security policies --- ## Phase 2: Shared Database Services (Weeks 6-9) ### 2.1 PostgreSQL Clusters **Dev/Staging Cluster**: - **Instances**: 1 primary + 1 replica - **Multi-tenancy**: Database per project - **Backup**: Daily automated backups - **Monitoring**: Shared Prometheus **Production Cluster**: - **Instances**: 1 primary + 2 replicas - **Multi-tenancy**: Database per project with isolation - **Backup**: Continuous backups + point-in-time recovery - **High Availability**: Automatic failover **Projects to Migrate**: - dbis_core - the_order - Sankofa - Other projects with PostgreSQL **Benefits**: - Reduced database overhead - Centralized backup management - Unified monitoring - Easier maintenance ### 2.2 Redis Clusters **Dev/Staging Cluster**: - **Instances**: 1 Redis instance (multi-database) - **Usage**: Caching, sessions, queues - **Monitoring**: Shared Prometheus **Production Cluster**: - **Instances**: Redis Cluster (3+ nodes) - **High Availability**: Automatic failover - **Persistence**: AOF + RDB snapshots - **Monitoring**: Shared Prometheus **Projects to Migrate**: - dbis_core - the_order - Other projects with Redis --- ## Phase 3: Unified Monitoring Stack (Weeks 7-10) ### 3.1 Prometheus/Grafana **Deployment**: - **Location**: Shared Kubernetes cluster - **Storage**: Persistent volumes (50-100 GB) - **Retention**: 30 days (metrics) - **Scraping**: All projects via service discovery **Configuration**: - Unified dashboards - Project-specific dashboards - Alert rules per project - Centralized alerting ### 3.2 Logging (Loki/ELK) **Option 1: Loki (Recommended)** - **Deployment**: Shared Kubernetes cluster - **Storage**: Object storage (MinIO, S3) - **Retention**: 90 days - **Query**: Grafana Loki **Option 2: ELK Stack** - **Deployment**: Separate cluster or VMs - **Storage**: Elasticsearch cluster - **Retention**: 90 days - **Query**: Kibana **Configuration**: - Centralized log aggregation - Project-specific log streams - Log parsing and indexing - Search and analysis ### 3.3 Alerting **System**: Alertmanager (Prometheus) - **Channels**: Email, Slack, PagerDuty - **Routing**: Per project, per severity - **Grouping**: Smart alert grouping - **Silencing**: Alert silencing interface --- ## Phase 4: Shared CI/CD Infrastructure (Weeks 8-11) ### 4.1 Container Registry **Option 1: Harbor (Recommended)** - **Deployment**: Shared Kubernetes cluster - **Features**: Vulnerability scanning, replication - **Storage**: Object storage backend - **Access**: Project-based access control **Option 2: GitLab Container Registry** - **Deployment**: GitLab instance - **Features**: Integrated with GitLab CI/CD - **Storage**: Object storage backend **Configuration**: - Project-specific repositories - Automated vulnerability scanning - Image signing - Retention policies ### 4.2 Build Infrastructure **Shared Build Runners**: - **Type**: Kubernetes runners (GitLab Runner, GitHub Actions Runner) - **Resources**: Auto-scaling based on queue - **Caching**: Shared build cache - **Isolation**: Per-project isolation **Benefits**: - Reduced build infrastructure - Faster builds (shared cache) - Consistent build environment - Centralized management --- ## Phase 5: Unified Networking (Weeks 9-12) ### 5.1 Ingress Controller **Deployment**: Traefik or NGINX Ingress Controller - **SSL/TLS**: Cert-Manager with Let's Encrypt - **Routing**: Per-project routing rules - **Load Balancing**: Unified load balancing - **Rate Limiting**: Per-project rate limits ### 5.2 Service Mesh (Optional) **Option**: Istio or Linkerd - **Features**: mTLS, traffic management, observability - **Benefits**: Enhanced security, traffic control - **Complexity**: Higher setup and maintenance --- ## Resource Requirements ### Shared Infrastructure Totals **Kubernetes Clusters**: - **Dev/Staging**: 50-100 CPU cores, 200-400 GB RAM - **Production**: 100-200 CPU cores, 400-800 GB RAM **Database Services**: - **PostgreSQL**: 20-40 CPU cores, 100-200 GB RAM, 500 GB - 2 TB storage - **Redis**: 8-16 CPU cores, 32-64 GB RAM, 100-200 GB storage **Monitoring Stack**: - **Prometheus/Grafana**: 8-16 CPU cores, 32-64 GB RAM, 500 GB - 1 TB storage - **Logging**: 16-32 CPU cores, 64-128 GB RAM, 1-2 TB storage **CI/CD Infrastructure**: - **Container Registry**: 4-8 CPU cores, 16-32 GB RAM, 500 GB - 1 TB storage - **Build Runners**: Auto-scaling (10-50 CPU cores peak) **Total Estimated Resources**: - **CPU**: 200-400 cores (shared) - **RAM**: 800-1600 GB (shared) - **Storage**: 3-6 TB (shared) **Cost Reduction**: 30-40% compared to separate infrastructure --- ## Migration Strategy ### Phase 1: Preparation (Weeks 1-2) - [ ] Design shared infrastructure architecture - [ ] Plan resource allocation - [ ] Create migration scripts - [ ] Set up monitoring baseline ### Phase 2: Dev/Staging (Weeks 3-6) - [ ] Deploy shared dev/staging cluster - [ ] Migrate 3-5 projects as pilot - [ ] Set up shared databases (dev/staging) - [ ] Deploy unified monitoring (dev/staging) - [ ] Test and validate ### Phase 3: Production (Weeks 7-12) - [ ] Deploy shared production cluster - [ ] Migrate projects to production cluster - [ ] Set up shared databases (production) - [ ] Deploy unified monitoring (production) - [ ] Complete migration ### Phase 4: Optimization (Weeks 13+) - [ ] Optimize resource allocation - [ ] Fine-tune monitoring and alerting - [ ] Performance optimization - [ ] Cost optimization --- ## Security Considerations ### Namespace Isolation - Network policies per namespace - RBAC per namespace - Resource quotas per namespace - Pod security policies ### Secrets Management - HashiCorp Vault or Kubernetes Secrets - Encrypted at rest - Encrypted in transit - Rotation policies ### Network Security - mTLS between services (optional service mesh) - Network policies - Ingress with WAF - DDoS protection --- ## Monitoring and Alerting ### Key Metrics - Resource utilization (CPU, RAM, storage) - Application performance (latency, throughput) - Error rates - Infrastructure health ### Alerting Rules - High resource utilization - Service failures - Security incidents - Performance degradation --- ## Success Metrics - [ ] 30-40% reduction in infrastructure costs - [ ] 80% of projects on shared infrastructure - [ ] 50% reduction in duplicate services - [ ] 99.9% uptime for shared services - [ ] 50% faster deployment times - [ ] Unified monitoring and alerting operational --- **Last Updated**: 2025-01-27 **Next Review**: After Phase 1 completion