docs/INFRASTRUCTURE_CONSOLIDATION_PLAN.md

# Infrastructure Consolidation Plan

**Date**: 2025-01-27
**Purpose**: Plan for consolidating infrastructure across all projects
**Status**: Implementation Plan

---

## Executive Summary

This plan outlines the strategy for consolidating infrastructure services across 40+ projects, reducing costs by 30-40%, and improving operational efficiency.

**Key Goals**:
- Shared Kubernetes clusters (dev/staging/prod)
- Unified monitoring stack
- Shared database services
- Consolidated CI/CD infrastructure
- Unified ingress and networking

---

## Current State Analysis

### Infrastructure Distribution

**Kubernetes Clusters**:
- Multiple project-specific clusters
- Inconsistent configurations
- Duplicate infrastructure components

**Databases**:
- Separate PostgreSQL instances per project
- Separate Redis instances per project
- No shared database services

**Monitoring**:
- Project-specific Prometheus/Grafana instances
- Inconsistent logging solutions
- No centralized alerting

**CI/CD**:
- Project-specific pipelines
- Duplicate build infrastructure
- Inconsistent deployment patterns

---

## Phase 1: Shared Kubernetes Infrastructure (Weeks 5-8)

### 1.1 Dev/Staging Cluster

**Configuration**:
- **Cluster**: K3s or RKE2 (lightweight, production-ready)
- **Location**: loc_az_hci Proxmox infrastructure
- **Namespaces**: One per project
- **Resource Quotas**: Per namespace
- **Networking**: Unified ingress (Traefik or NGINX)

**Projects to Migrate**:
- dbis_core (dev/staging)
- the_order (dev/staging)
- Sankofa (dev/staging)
- Web applications (dev/staging)

**Benefits**:
- Reduced infrastructure overhead
- Consistent deployment patterns
- Shared resources (CPU, memory)
- Unified networking

### 1.2 Production Cluster

**Configuration**:
- **Cluster**: K3s or RKE2 (high availability)
- **Location**: Multi-region (loc_az_hci + cloud)
- **Namespaces**: One per project with isolation
- **Resource Limits**: Strict quotas
- **Networking**: Unified ingress with SSL/TLS

**Projects to Migrate**:
- dbis_core (production)
- the_order (production)
- Sankofa (production)
- Critical web applications

**Security**:
- Network policies per namespace
- RBAC per namespace
- Secrets management (Vault)
- Pod security policies

---

## Phase 2: Shared Database Services (Weeks 6-9)

### 2.1 PostgreSQL Clusters

**Dev/Staging Cluster**:
- **Instances**: 1 primary + 1 replica
- **Multi-tenancy**: Database per project
- **Backup**: Daily automated backups
- **Monitoring**: Shared Prometheus

**Production Cluster**:
- **Instances**: 1 primary + 2 replicas
- **Multi-tenancy**: Database per project with isolation
- **Backup**: Continuous backups + point-in-time recovery
- **High Availability**: Automatic failover

**Projects to Migrate**:
- dbis_core
- the_order
- Sankofa
- Other projects with PostgreSQL

**Benefits**:
- Reduced database overhead
- Centralized backup management
- Unified monitoring
- Easier maintenance

### 2.2 Redis Clusters

**Dev/Staging Cluster**:
- **Instances**: 1 Redis instance (multi-database)
- **Usage**: Caching, sessions, queues
- **Monitoring**: Shared Prometheus

**Production Cluster**:
- **Instances**: Redis Cluster (3+ nodes)
- **High Availability**: Automatic failover
- **Persistence**: AOF + RDB snapshots
- **Monitoring**: Shared Prometheus

**Projects to Migrate**:
- dbis_core
- the_order
- Other projects with Redis

---

## Phase 3: Unified Monitoring Stack (Weeks 7-10)

### 3.1 Prometheus/Grafana

**Deployment**:
- **Location**: Shared Kubernetes cluster
- **Storage**: Persistent volumes (50-100 GB)
- **Retention**: 30 days (metrics)
- **Scraping**: All projects via service discovery

**Configuration**:
- Unified dashboards
- Project-specific dashboards
- Alert rules per project
- Centralized alerting

### 3.2 Logging (Loki/ELK)

**Option 1: Loki (Recommended)**
- **Deployment**: Shared Kubernetes cluster
- **Storage**: Object storage (MinIO, S3)
- **Retention**: 90 days
- **Query**: Grafana Loki

**Option 2: ELK Stack**
- **Deployment**: Separate cluster or VMs
- **Storage**: Elasticsearch cluster
- **Retention**: 90 days
- **Query**: Kibana

**Configuration**:
- Centralized log aggregation
- Project-specific log streams
- Log parsing and indexing
- Search and analysis

### 3.3 Alerting

**System**: Alertmanager (Prometheus)
- **Channels**: Email, Slack, PagerDuty
- **Routing**: Per project, per severity
- **Grouping**: Smart alert grouping
- **Silencing**: Alert silencing interface

---

## Phase 4: Shared CI/CD Infrastructure (Weeks 8-11)

### 4.1 Container Registry

**Option 1: Harbor (Recommended)**
- **Deployment**: Shared Kubernetes cluster
- **Features**: Vulnerability scanning, replication
- **Storage**: Object storage backend
- **Access**: Project-based access control

**Option 2: GitLab Container Registry**
- **Deployment**: GitLab instance
- **Features**: Integrated with GitLab CI/CD
- **Storage**: Object storage backend

**Configuration**:
- Project-specific repositories
- Automated vulnerability scanning
- Image signing
- Retention policies

### 4.2 Build Infrastructure

**Shared Build Runners**:
- **Type**: Kubernetes runners (GitLab Runner, GitHub Actions Runner)
- **Resources**: Auto-scaling based on queue
- **Caching**: Shared build cache
- **Isolation**: Per-project isolation

**Benefits**:
- Reduced build infrastructure
- Faster builds (shared cache)
- Consistent build environment
- Centralized management

---

## Phase 5: Unified Networking (Weeks 9-12)

### 5.1 Ingress Controller

**Deployment**: Traefik or NGINX Ingress Controller
- **SSL/TLS**: Cert-Manager with Let's Encrypt
- **Routing**: Per-project routing rules
- **Load Balancing**: Unified load balancing
- **Rate Limiting**: Per-project rate limits

### 5.2 Service Mesh (Optional)

**Option**: Istio or Linkerd
- **Features**: mTLS, traffic management, observability
- **Benefits**: Enhanced security, traffic control
- **Complexity**: Higher setup and maintenance

---

## Resource Requirements

### Shared Infrastructure Totals

**Kubernetes Clusters**:
- **Dev/Staging**: 50-100 CPU cores, 200-400 GB RAM
- **Production**: 100-200 CPU cores, 400-800 GB RAM

**Database Services**:
- **PostgreSQL**: 20-40 CPU cores, 100-200 GB RAM, 500 GB - 2 TB storage
- **Redis**: 8-16 CPU cores, 32-64 GB RAM, 100-200 GB storage

**Monitoring Stack**:
- **Prometheus/Grafana**: 8-16 CPU cores, 32-64 GB RAM, 500 GB - 1 TB storage
- **Logging**: 16-32 CPU cores, 64-128 GB RAM, 1-2 TB storage

**CI/CD Infrastructure**:
- **Container Registry**: 4-8 CPU cores, 16-32 GB RAM, 500 GB - 1 TB storage
- **Build Runners**: Auto-scaling (10-50 CPU cores peak)

**Total Estimated Resources**:
- **CPU**: 200-400 cores (shared)
- **RAM**: 800-1600 GB (shared)
- **Storage**: 3-6 TB (shared)

**Cost Reduction**: 30-40% compared to separate infrastructure

---

## Migration Strategy

### Phase 1: Preparation (Weeks 1-2)
- [ ] Design shared infrastructure architecture
- [ ] Plan resource allocation
- [ ] Create migration scripts
- [ ] Set up monitoring baseline

### Phase 2: Dev/Staging (Weeks 3-6)
- [ ] Deploy shared dev/staging cluster
- [ ] Migrate 3-5 projects as pilot
- [ ] Set up shared databases (dev/staging)
- [ ] Deploy unified monitoring (dev/staging)
- [ ] Test and validate

### Phase 3: Production (Weeks 7-12)
- [ ] Deploy shared production cluster
- [ ] Migrate projects to production cluster
- [ ] Set up shared databases (production)
- [ ] Deploy unified monitoring (production)
- [ ] Complete migration

### Phase 4: Optimization (Weeks 13+)
- [ ] Optimize resource allocation
- [ ] Fine-tune monitoring and alerting
- [ ] Performance optimization
- [ ] Cost optimization

---

## Security Considerations

### Namespace Isolation
- Network policies per namespace
- RBAC per namespace
- Resource quotas per namespace
- Pod security policies

### Secrets Management
- HashiCorp Vault or Kubernetes Secrets
- Encrypted at rest
- Encrypted in transit
- Rotation policies

### Network Security
- mTLS between services (optional service mesh)
- Network policies
- Ingress with WAF
- DDoS protection

---

## Monitoring and Alerting

### Key Metrics
- Resource utilization (CPU, RAM, storage)
- Application performance (latency, throughput)
- Error rates
- Infrastructure health

### Alerting Rules
- High resource utilization
- Service failures
- Security incidents
- Performance degradation

---

## Success Metrics

- [ ] 30-40% reduction in infrastructure costs
- [ ] 80% of projects on shared infrastructure
- [ ] 50% reduction in duplicate services
- [ ] 99.9% uptime for shared services
- [ ] 50% faster deployment times
- [ ] Unified monitoring and alerting operational

---

**Last Updated**: 2025-01-27
**Next Review**: After Phase 1 completion