Files
docs/INFRASTRUCTURE_CONSOLIDATION_PLAN.md
2026-02-09 21:51:46 -08:00

8.5 KiB

Infrastructure Consolidation Plan

Date: 2025-01-27 Purpose: Plan for consolidating infrastructure across all projects Status: Implementation Plan


Executive Summary

This plan outlines the strategy for consolidating infrastructure services across 40+ projects, reducing costs by 30-40%, and improving operational efficiency.

Key Goals:

  • Shared Kubernetes clusters (dev/staging/prod)
  • Unified monitoring stack
  • Shared database services
  • Consolidated CI/CD infrastructure
  • Unified ingress and networking

Current State Analysis

Infrastructure Distribution

Kubernetes Clusters:

  • Multiple project-specific clusters
  • Inconsistent configurations
  • Duplicate infrastructure components

Databases:

  • Separate PostgreSQL instances per project
  • Separate Redis instances per project
  • No shared database services

Monitoring:

  • Project-specific Prometheus/Grafana instances
  • Inconsistent logging solutions
  • No centralized alerting

CI/CD:

  • Project-specific pipelines
  • Duplicate build infrastructure
  • Inconsistent deployment patterns

Phase 1: Shared Kubernetes Infrastructure (Weeks 5-8)

1.1 Dev/Staging Cluster

Configuration:

  • Cluster: K3s or RKE2 (lightweight, production-ready)
  • Location: loc_az_hci Proxmox infrastructure
  • Namespaces: One per project
  • Resource Quotas: Per namespace
  • Networking: Unified ingress (Traefik or NGINX)

Projects to Migrate:

  • dbis_core (dev/staging)
  • the_order (dev/staging)
  • Sankofa (dev/staging)
  • Web applications (dev/staging)

Benefits:

  • Reduced infrastructure overhead
  • Consistent deployment patterns
  • Shared resources (CPU, memory)
  • Unified networking

1.2 Production Cluster

Configuration:

  • Cluster: K3s or RKE2 (high availability)
  • Location: Multi-region (loc_az_hci + cloud)
  • Namespaces: One per project with isolation
  • Resource Limits: Strict quotas
  • Networking: Unified ingress with SSL/TLS

Projects to Migrate:

  • dbis_core (production)
  • the_order (production)
  • Sankofa (production)
  • Critical web applications

Security:

  • Network policies per namespace
  • RBAC per namespace
  • Secrets management (Vault)
  • Pod security policies

Phase 2: Shared Database Services (Weeks 6-9)

2.1 PostgreSQL Clusters

Dev/Staging Cluster:

  • Instances: 1 primary + 1 replica
  • Multi-tenancy: Database per project
  • Backup: Daily automated backups
  • Monitoring: Shared Prometheus

Production Cluster:

  • Instances: 1 primary + 2 replicas
  • Multi-tenancy: Database per project with isolation
  • Backup: Continuous backups + point-in-time recovery
  • High Availability: Automatic failover

Projects to Migrate:

  • dbis_core
  • the_order
  • Sankofa
  • Other projects with PostgreSQL

Benefits:

  • Reduced database overhead
  • Centralized backup management
  • Unified monitoring
  • Easier maintenance

2.2 Redis Clusters

Dev/Staging Cluster:

  • Instances: 1 Redis instance (multi-database)
  • Usage: Caching, sessions, queues
  • Monitoring: Shared Prometheus

Production Cluster:

  • Instances: Redis Cluster (3+ nodes)
  • High Availability: Automatic failover
  • Persistence: AOF + RDB snapshots
  • Monitoring: Shared Prometheus

Projects to Migrate:

  • dbis_core
  • the_order
  • Other projects with Redis

Phase 3: Unified Monitoring Stack (Weeks 7-10)

3.1 Prometheus/Grafana

Deployment:

  • Location: Shared Kubernetes cluster
  • Storage: Persistent volumes (50-100 GB)
  • Retention: 30 days (metrics)
  • Scraping: All projects via service discovery

Configuration:

  • Unified dashboards
  • Project-specific dashboards
  • Alert rules per project
  • Centralized alerting

3.2 Logging (Loki/ELK)

Option 1: Loki (Recommended)

  • Deployment: Shared Kubernetes cluster
  • Storage: Object storage (MinIO, S3)
  • Retention: 90 days
  • Query: Grafana Loki

Option 2: ELK Stack

  • Deployment: Separate cluster or VMs
  • Storage: Elasticsearch cluster
  • Retention: 90 days
  • Query: Kibana

Configuration:

  • Centralized log aggregation
  • Project-specific log streams
  • Log parsing and indexing
  • Search and analysis

3.3 Alerting

System: Alertmanager (Prometheus)

  • Channels: Email, Slack, PagerDuty
  • Routing: Per project, per severity
  • Grouping: Smart alert grouping
  • Silencing: Alert silencing interface

Phase 4: Shared CI/CD Infrastructure (Weeks 8-11)

4.1 Container Registry

Option 1: Harbor (Recommended)

  • Deployment: Shared Kubernetes cluster
  • Features: Vulnerability scanning, replication
  • Storage: Object storage backend
  • Access: Project-based access control

Option 2: GitLab Container Registry

  • Deployment: GitLab instance
  • Features: Integrated with GitLab CI/CD
  • Storage: Object storage backend

Configuration:

  • Project-specific repositories
  • Automated vulnerability scanning
  • Image signing
  • Retention policies

4.2 Build Infrastructure

Shared Build Runners:

  • Type: Kubernetes runners (GitLab Runner, GitHub Actions Runner)
  • Resources: Auto-scaling based on queue
  • Caching: Shared build cache
  • Isolation: Per-project isolation

Benefits:

  • Reduced build infrastructure
  • Faster builds (shared cache)
  • Consistent build environment
  • Centralized management

Phase 5: Unified Networking (Weeks 9-12)

5.1 Ingress Controller

Deployment: Traefik or NGINX Ingress Controller

  • SSL/TLS: Cert-Manager with Let's Encrypt
  • Routing: Per-project routing rules
  • Load Balancing: Unified load balancing
  • Rate Limiting: Per-project rate limits

5.2 Service Mesh (Optional)

Option: Istio or Linkerd

  • Features: mTLS, traffic management, observability
  • Benefits: Enhanced security, traffic control
  • Complexity: Higher setup and maintenance

Resource Requirements

Shared Infrastructure Totals

Kubernetes Clusters:

  • Dev/Staging: 50-100 CPU cores, 200-400 GB RAM
  • Production: 100-200 CPU cores, 400-800 GB RAM

Database Services:

  • PostgreSQL: 20-40 CPU cores, 100-200 GB RAM, 500 GB - 2 TB storage
  • Redis: 8-16 CPU cores, 32-64 GB RAM, 100-200 GB storage

Monitoring Stack:

  • Prometheus/Grafana: 8-16 CPU cores, 32-64 GB RAM, 500 GB - 1 TB storage
  • Logging: 16-32 CPU cores, 64-128 GB RAM, 1-2 TB storage

CI/CD Infrastructure:

  • Container Registry: 4-8 CPU cores, 16-32 GB RAM, 500 GB - 1 TB storage
  • Build Runners: Auto-scaling (10-50 CPU cores peak)

Total Estimated Resources:

  • CPU: 200-400 cores (shared)
  • RAM: 800-1600 GB (shared)
  • Storage: 3-6 TB (shared)

Cost Reduction: 30-40% compared to separate infrastructure


Migration Strategy

Phase 1: Preparation (Weeks 1-2)

  • Design shared infrastructure architecture
  • Plan resource allocation
  • Create migration scripts
  • Set up monitoring baseline

Phase 2: Dev/Staging (Weeks 3-6)

  • Deploy shared dev/staging cluster
  • Migrate 3-5 projects as pilot
  • Set up shared databases (dev/staging)
  • Deploy unified monitoring (dev/staging)
  • Test and validate

Phase 3: Production (Weeks 7-12)

  • Deploy shared production cluster
  • Migrate projects to production cluster
  • Set up shared databases (production)
  • Deploy unified monitoring (production)
  • Complete migration

Phase 4: Optimization (Weeks 13+)

  • Optimize resource allocation
  • Fine-tune monitoring and alerting
  • Performance optimization
  • Cost optimization

Security Considerations

Namespace Isolation

  • Network policies per namespace
  • RBAC per namespace
  • Resource quotas per namespace
  • Pod security policies

Secrets Management

  • HashiCorp Vault or Kubernetes Secrets
  • Encrypted at rest
  • Encrypted in transit
  • Rotation policies

Network Security

  • mTLS between services (optional service mesh)
  • Network policies
  • Ingress with WAF
  • DDoS protection

Monitoring and Alerting

Key Metrics

  • Resource utilization (CPU, RAM, storage)
  • Application performance (latency, throughput)
  • Error rates
  • Infrastructure health

Alerting Rules

  • High resource utilization
  • Service failures
  • Security incidents
  • Performance degradation

Success Metrics

  • 30-40% reduction in infrastructure costs
  • 80% of projects on shared infrastructure
  • 50% reduction in duplicate services
  • 99.9% uptime for shared services
  • 50% faster deployment times
  • Unified monitoring and alerting operational

Last Updated: 2025-01-27 Next Review: After Phase 1 completion