Files
loc_az_hci/docs/deployment/bring-up-checklist.md
defiQUG c39465c2bd
Some checks failed
Test / test (push) Has been cancelled
Initial commit: loc_az_hci (smom-dbis-138 excluded via .gitignore)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-08 09:04:46 -08:00

378 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bring-Up Checklist
## Day-One Installation Guide
This checklist provides a step-by-step guide for bringing up the complete Azure Stack HCI environment on installation day.
## Pre-Installation Preparation
### Hardware Verification
- [ ] Router server chassis received and inspected
- [ ] All PCIe cards received (NICs, HBAs, QAT)
- [ ] Memory modules received (8× 4GB DDR4 ECC RDIMM)
- [ ] Storage SSD received (256GB)
- [ ] All cables received (Ethernet, Mini-SAS HD)
- [ ] Storage shelves received and inspected
- [ ] Proxmox hosts (ML110, R630) verified operational
### Documentation Review
- [ ] Complete architecture reviewed
- [ ] PCIe slot allocation map reviewed
- [ ] Network topology and VLAN schema reviewed
- [ ] Driver matrix reviewed
- [ ] All configuration files prepared
### Environment Configuration
- [ ] Copy `.env.example` to `.env`
- [ ] Configure Azure credentials in `.env`:
- [ ] `AZURE_SUBSCRIPTION_ID`
- [ ] `AZURE_TENANT_ID`
- [ ] `AZURE_RESOURCE_GROUP`
- [ ] `AZURE_LOCATION`
- [ ] Configure Cloudflare credentials in `.env`:
- [ ] `CLOUDFLARE_API_TOKEN`
- [ ] `CLOUDFLARE_ACCOUNT_EMAIL`
- [ ] Configure Proxmox credentials in `.env`:
- [ ] `PVE_ROOT_PASS` (shared root password for all instances)
- [ ] `PROXMOX_ML110_URL`
- [ ] `PROXMOX_R630_URL`
- [ ] Note: Username `root@pam` is implied and should not be stored
- [ ] For production: Create RBAC accounts and use API tokens instead of root
- [ ] Verify `.env` file is in `.gitignore` (should not be committed)
## Phase 1: Hardware Installation
### Router Server Assembly
- [ ] Install CPU and memory (8× 4GB DDR4 ECC RDIMM)
- [ ] Install boot SSD (256GB)
- [ ] Install Intel QAT 8970 in x16_1 slot
- [ ] Install Intel X550-T2 in x8_1 slot
- [ ] Install LSI 9207-8e #1 in x8_2 slot
- [ ] Install LSI 9207-8e #2 in x8_3 slot
- [ ] Install Intel i350-T4 in x4_1 slot
- [ ] Install Intel i350-T8 in x4_2 slot
- [ ] Install Intel i225 Quad-Port in x4_3 slot
- [ ] Verify all cards seated properly
- [ ] Connect power and verify POST
### BIOS/UEFI Configuration
- [ ] Enter BIOS/UEFI setup
- [ ] Verify all PCIe cards detected
- [ ] Configure boot order (SSD first)
- [ ] Enable virtualization (Intel VT-x, VT-d)
- [ ] Configure memory settings (ECC enabled)
- [ ] Set date/time
- [ ] Save and exit BIOS
### Storage Shelf Cabling
- [ ] Connect SFF-8644 cables from LSI HBA #1 to shelves 1-2
- [ ] Connect SFF-8644 cables from LSI HBA #2 to shelves 3-4
- [ ] Power on storage shelves
- [ ] Verify shelf power and status LEDs
- [ ] Label all cables
### Network Cabling
- [ ] Connect 4× Cat6 cables from i350-T4 to Spectrum modems/ONTs (WAN1-4)
- [ ] Connect 2× Cat6a cables to X550-T2 (reserved for future)
- [ ] Connect 4× Cat6 cables from i225 Quad to ML110, R630, and key services
- [ ] Connect 8× Cat6 cables from i350-T8 to remaining servers/appliances
- [ ] Label all cables at both ends
- [ ] Document cable mapping
## Phase 2: Operating System Installation
### Router Server OS
**Option A: Windows Server Core**
- [ ] Boot from Windows Server installation media
- [ ] Install Windows Server Core
- [ ] Configure initial administrator password
- [ ] Install Windows Updates
- [ ] Configure static IP on management interface
- [ ] Enable Remote Desktop (if needed)
- [ ] Install Windows Admin Center
**Option B: Proxmox VE**
- [ ] Boot from Proxmox VE installation media
- [ ] Install Proxmox VE
- [ ] Configure initial root password
- [ ] Configure network (management interface)
- [ ] Update Proxmox packages
- [ ] Verify Proxmox web interface accessible
### Proxmox Hosts (ML110, R630)
- [ ] Verify Proxmox VE installed and updated
- [ ] Configure network interfaces
- [ ] Verify cluster status (if clustered)
- [ ] Test VM creation
## Phase 3: Driver Installation
### Router Server Drivers
- [ ] Install Intel PROSet drivers for all NICs
- [ ] i350-T4 (WAN)
- [ ] i350-T8 (LAN 1GbE)
- [ ] X550-T2 (10GbE)
- [ ] i225 Quad-Port (LAN 2.5GbE)
- [ ] Verify all NICs detected and functional
- [ ] Install LSI mpt3sas driver
- [ ] Flash LSI HBAs to IT mode
- [ ] Verify storage shelves detected
- [ ] Install Intel QAT drivers (qatlib)
- [ ] Install OpenSSL QAT engine
- [ ] Verify QAT acceleration working
### Driver Verification
- [ ] Run driver verification script
- [ ] Test all network ports
- [ ] Test storage connectivity
- [ ] Test QAT acceleration
- [ ] Document any issues
## Phase 4: Network Configuration
### OpenWrt VM Setup
- [ ] Create OpenWrt VM on Router server
- [ ] Configure OpenWrt network interfaces
- [ ] Configure VLANs (10, 20, 30, 40, 50, 60, 99)
- [ ] Configure mwan3 for 4× Spectrum WAN
- [ ] Configure firewall zones
- [ ] Test multi-WAN failover
- [ ] Configure inter-VLAN routing
### Proxmox VLAN Configuration
- [ ] Configure VLAN bridges on ML110
- [ ] Configure VLAN bridges on R630
- [ ] Test VLAN connectivity
- [ ] Verify VM network isolation
### IP Address Configuration
- [ ] Configure IP addresses per VLAN schema
- [ ] Configure DNS settings
- [ ] Test network connectivity
- [ ] Verify routing between VLANs
## Phase 5: Storage Configuration
### Storage Spaces Direct Setup
- [ ] Verify all shelves detected
- [ ] Create Storage Spaces Direct pools
- [ ] Create volumes for VMs
- [ ] Create volumes for applications
- [ ] Configure storage exports (NFS/iSCSI)
### Proxmox Storage Mounts
- [ ] Configure NFS mounts on ML110
- [ ] Configure NFS mounts on R630
- [ ] Test storage connectivity
- [ ] Verify VM storage access
## Phase 6: Azure Arc Onboarding
### Arc Agent Installation
- [ ] Install Azure Arc agent on Router server (if Linux)
- [ ] Install Azure Arc agent on ML110
- [ ] Install Azure Arc agent on R630
- [ ] Install Azure Arc agent on Windows management VM (if applicable)
### Arc Onboarding
- [ ] Load environment variables from `.env`: `export $(cat .env | grep -v '^#' | xargs)`
- [ ] Configure Azure subscription and resource group (from `.env`)
- [ ] Onboard Router server to Azure Arc
- [ ] Onboard ML110 to Azure Arc
- [ ] Onboard R630 to Azure Arc
- [ ] Verify all resources visible in Azure Portal
### Arc Governance
- [ ] Configure Azure Policy
- [ ] Enable Azure Monitor
- [ ] Enable Azure Defender
- [ ] Configure Update Management
- [ ] Test policy enforcement
## Phase 7: Cloudflare Integration
### Cloudflare Tunnel Setup
- [ ] Create Cloudflare account (if not exists)
- [ ] Create Zero Trust organization
- [ ] Configure Cloudflare API token in `.env` file
- [ ] Install cloudflared on Ubuntu VM
- [ ] Authenticate cloudflared (interactive or using API token from `.env`)
- [ ] Configure Tunnel for WAC
- [ ] Configure Tunnel for Proxmox UI
- [ ] Configure Tunnel for dashboards
- [ ] Configure Tunnel for Git/CI services
### Zero Trust Policies
- [ ] Configure SSO (Azure AD/Okta)
- [ ] Configure MFA requirements
- [ ] Configure device posture checks
- [ ] Configure access policies
- [ ] Test external access
### WAF Configuration
- [ ] Configure WAF rules
- [ ] Test WAF protection
- [ ] Verify no inbound ports required
## Phase 8: Service VM Deployment
### Ubuntu VM Templates
- [ ] Create Ubuntu LTS template on Proxmox
- [ ] Install Azure Arc agent in template
- [ ] Configure base packages
- [ ] Create VM snapshots
### Service VM Deployment
- [ ] Deploy Cloudflare Tunnel VM (VLAN 99)
- [ ] Deploy Reverse Proxy VM (VLAN 30/99)
- [ ] Deploy Observability VM (VLAN 40)
- [ ] Deploy CI/CD VM (VLAN 50)
- [ ] Install Azure Arc agents on all VMs
### Service Configuration
- [ ] Configure Cloudflare Tunnel
- [ ] Configure reverse proxy (NGINX/Traefik)
- [ ] Configure observability stack (Prometheus/Grafana)
- [ ] Configure CI/CD (GitLab Runner/Jenkins)
## Phase 9: Verification and Testing
### Network Testing
- [ ] Test all WAN connections
- [ ] Test multi-WAN failover
- [ ] Test VLAN isolation
- [ ] Test inter-VLAN routing
- [ ] Test firewall rules
### Storage Testing
- [ ] Test storage read/write performance
- [ ] Test storage redundancy
- [ ] Test VM storage access
- [ ] Test storage exports
### Service Testing
- [ ] Test Cloudflare Tunnel access
- [ ] Test Azure Arc connectivity
- [ ] Test observability dashboards
- [ ] Test CI/CD pipelines
### Performance Testing
- [ ] Test QAT acceleration
- [ ] Test network throughput
- [ ] Test storage I/O
- [ ] Document performance metrics
## Phase 10: Documentation and Handoff
### Documentation
- [ ] Document all IP addresses
- [ ] Verify `.env` file contains all credentials (stored securely, not in version control)
- [ ] Document cable mappings
- [ ] Document VLAN configurations
- [ ] Document storage allocations
- [ ] Create network diagrams
- [ ] Create runbooks
- [ ] Verify `.env` is in `.gitignore` and not committed to repository
### Monitoring Setup
- [ ] Configure Grafana dashboards
- [ ] Configure Prometheus alerts
- [ ] Configure Azure Monitor alerts
- [ ] Test alerting
### Security Hardening
- [ ] Review firewall rules
- [ ] Review access policies
- [ ] Create RBAC accounts for Proxmox (replace root usage)
- [ ] Create service accounts for automation
- [ ] Create operator accounts with appropriate roles
- [ ] Generate API tokens for service accounts
- [ ] Document RBAC account usage (see docs/security/proxmox-rbac.md)
- [ ] Review secret management
- [ ] Perform security scan
## Post-Installation Tasks
### Ongoing Maintenance
- [ ] Schedule regular backups
- [ ] Schedule firmware updates
- [ ] Schedule driver updates
- [ ] Schedule OS updates
- [ ] Schedule security patches
### Monitoring
- [ ] Review monitoring dashboards daily
- [ ] Review Azure Arc status
- [ ] Review Cloudflare Tunnel status
- [ ] Review storage health
- [ ] Review network performance
## Troubleshooting Reference
### Common Issues
**Issue:** NIC not detected
- Check PCIe slot connection
- Check BIOS settings
- Update driver
**Issue:** Storage shelves not detected
- Check cable connections
- Check HBA firmware
- Check shelf power
**Issue:** Azure Arc not connecting
- Check network connectivity
- Check proxy settings
- Check Azure credentials
**Issue:** Cloudflare Tunnel not working
- Check cloudflared service
- Check Tunnel configuration
- Check Zero Trust policies
## Related Documentation
- [Complete Architecture](complete-architecture.md) - Full architecture overview
- [Hardware BOM](hardware-bom.md) - Complete bill of materials
- [PCIe Allocation](pcie-allocation.md) - Slot allocation map
- [Network Topology](network-topology.md) - VLAN/IP schema
- [Driver Matrix](driver-matrix.md) - Driver versions