Initial commit: loc_az_hci (smom-dbis-138 excluded via .gitignore)
Some checks failed
Test / test (push) Has been cancelled

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
defiQUG
2026-02-08 09:04:46 -08:00
commit c39465c2bd
386 changed files with 50649 additions and 0 deletions

View File

@@ -0,0 +1,377 @@
# Bring-Up Checklist
## Day-One Installation Guide
This checklist provides a step-by-step guide for bringing up the complete Azure Stack HCI environment on installation day.
## Pre-Installation Preparation
### Hardware Verification
- [ ] Router server chassis received and inspected
- [ ] All PCIe cards received (NICs, HBAs, QAT)
- [ ] Memory modules received (8× 4GB DDR4 ECC RDIMM)
- [ ] Storage SSD received (256GB)
- [ ] All cables received (Ethernet, Mini-SAS HD)
- [ ] Storage shelves received and inspected
- [ ] Proxmox hosts (ML110, R630) verified operational
### Documentation Review
- [ ] Complete architecture reviewed
- [ ] PCIe slot allocation map reviewed
- [ ] Network topology and VLAN schema reviewed
- [ ] Driver matrix reviewed
- [ ] All configuration files prepared
### Environment Configuration
- [ ] Copy `.env.example` to `.env`
- [ ] Configure Azure credentials in `.env`:
- [ ] `AZURE_SUBSCRIPTION_ID`
- [ ] `AZURE_TENANT_ID`
- [ ] `AZURE_RESOURCE_GROUP`
- [ ] `AZURE_LOCATION`
- [ ] Configure Cloudflare credentials in `.env`:
- [ ] `CLOUDFLARE_API_TOKEN`
- [ ] `CLOUDFLARE_ACCOUNT_EMAIL`
- [ ] Configure Proxmox credentials in `.env`:
- [ ] `PVE_ROOT_PASS` (shared root password for all instances)
- [ ] `PROXMOX_ML110_URL`
- [ ] `PROXMOX_R630_URL`
- [ ] Note: Username `root@pam` is implied and should not be stored
- [ ] For production: Create RBAC accounts and use API tokens instead of root
- [ ] Verify `.env` file is in `.gitignore` (should not be committed)
## Phase 1: Hardware Installation
### Router Server Assembly
- [ ] Install CPU and memory (8× 4GB DDR4 ECC RDIMM)
- [ ] Install boot SSD (256GB)
- [ ] Install Intel QAT 8970 in x16_1 slot
- [ ] Install Intel X550-T2 in x8_1 slot
- [ ] Install LSI 9207-8e #1 in x8_2 slot
- [ ] Install LSI 9207-8e #2 in x8_3 slot
- [ ] Install Intel i350-T4 in x4_1 slot
- [ ] Install Intel i350-T8 in x4_2 slot
- [ ] Install Intel i225 Quad-Port in x4_3 slot
- [ ] Verify all cards seated properly
- [ ] Connect power and verify POST
### BIOS/UEFI Configuration
- [ ] Enter BIOS/UEFI setup
- [ ] Verify all PCIe cards detected
- [ ] Configure boot order (SSD first)
- [ ] Enable virtualization (Intel VT-x, VT-d)
- [ ] Configure memory settings (ECC enabled)
- [ ] Set date/time
- [ ] Save and exit BIOS
### Storage Shelf Cabling
- [ ] Connect SFF-8644 cables from LSI HBA #1 to shelves 1-2
- [ ] Connect SFF-8644 cables from LSI HBA #2 to shelves 3-4
- [ ] Power on storage shelves
- [ ] Verify shelf power and status LEDs
- [ ] Label all cables
### Network Cabling
- [ ] Connect 4× Cat6 cables from i350-T4 to Spectrum modems/ONTs (WAN1-4)
- [ ] Connect 2× Cat6a cables to X550-T2 (reserved for future)
- [ ] Connect 4× Cat6 cables from i225 Quad to ML110, R630, and key services
- [ ] Connect 8× Cat6 cables from i350-T8 to remaining servers/appliances
- [ ] Label all cables at both ends
- [ ] Document cable mapping
## Phase 2: Operating System Installation
### Router Server OS
**Option A: Windows Server Core**
- [ ] Boot from Windows Server installation media
- [ ] Install Windows Server Core
- [ ] Configure initial administrator password
- [ ] Install Windows Updates
- [ ] Configure static IP on management interface
- [ ] Enable Remote Desktop (if needed)
- [ ] Install Windows Admin Center
**Option B: Proxmox VE**
- [ ] Boot from Proxmox VE installation media
- [ ] Install Proxmox VE
- [ ] Configure initial root password
- [ ] Configure network (management interface)
- [ ] Update Proxmox packages
- [ ] Verify Proxmox web interface accessible
### Proxmox Hosts (ML110, R630)
- [ ] Verify Proxmox VE installed and updated
- [ ] Configure network interfaces
- [ ] Verify cluster status (if clustered)
- [ ] Test VM creation
## Phase 3: Driver Installation
### Router Server Drivers
- [ ] Install Intel PROSet drivers for all NICs
- [ ] i350-T4 (WAN)
- [ ] i350-T8 (LAN 1GbE)
- [ ] X550-T2 (10GbE)
- [ ] i225 Quad-Port (LAN 2.5GbE)
- [ ] Verify all NICs detected and functional
- [ ] Install LSI mpt3sas driver
- [ ] Flash LSI HBAs to IT mode
- [ ] Verify storage shelves detected
- [ ] Install Intel QAT drivers (qatlib)
- [ ] Install OpenSSL QAT engine
- [ ] Verify QAT acceleration working
### Driver Verification
- [ ] Run driver verification script
- [ ] Test all network ports
- [ ] Test storage connectivity
- [ ] Test QAT acceleration
- [ ] Document any issues
## Phase 4: Network Configuration
### OpenWrt VM Setup
- [ ] Create OpenWrt VM on Router server
- [ ] Configure OpenWrt network interfaces
- [ ] Configure VLANs (10, 20, 30, 40, 50, 60, 99)
- [ ] Configure mwan3 for 4× Spectrum WAN
- [ ] Configure firewall zones
- [ ] Test multi-WAN failover
- [ ] Configure inter-VLAN routing
### Proxmox VLAN Configuration
- [ ] Configure VLAN bridges on ML110
- [ ] Configure VLAN bridges on R630
- [ ] Test VLAN connectivity
- [ ] Verify VM network isolation
### IP Address Configuration
- [ ] Configure IP addresses per VLAN schema
- [ ] Configure DNS settings
- [ ] Test network connectivity
- [ ] Verify routing between VLANs
## Phase 5: Storage Configuration
### Storage Spaces Direct Setup
- [ ] Verify all shelves detected
- [ ] Create Storage Spaces Direct pools
- [ ] Create volumes for VMs
- [ ] Create volumes for applications
- [ ] Configure storage exports (NFS/iSCSI)
### Proxmox Storage Mounts
- [ ] Configure NFS mounts on ML110
- [ ] Configure NFS mounts on R630
- [ ] Test storage connectivity
- [ ] Verify VM storage access
## Phase 6: Azure Arc Onboarding
### Arc Agent Installation
- [ ] Install Azure Arc agent on Router server (if Linux)
- [ ] Install Azure Arc agent on ML110
- [ ] Install Azure Arc agent on R630
- [ ] Install Azure Arc agent on Windows management VM (if applicable)
### Arc Onboarding
- [ ] Load environment variables from `.env`: `export $(cat .env | grep -v '^#' | xargs)`
- [ ] Configure Azure subscription and resource group (from `.env`)
- [ ] Onboard Router server to Azure Arc
- [ ] Onboard ML110 to Azure Arc
- [ ] Onboard R630 to Azure Arc
- [ ] Verify all resources visible in Azure Portal
### Arc Governance
- [ ] Configure Azure Policy
- [ ] Enable Azure Monitor
- [ ] Enable Azure Defender
- [ ] Configure Update Management
- [ ] Test policy enforcement
## Phase 7: Cloudflare Integration
### Cloudflare Tunnel Setup
- [ ] Create Cloudflare account (if not exists)
- [ ] Create Zero Trust organization
- [ ] Configure Cloudflare API token in `.env` file
- [ ] Install cloudflared on Ubuntu VM
- [ ] Authenticate cloudflared (interactive or using API token from `.env`)
- [ ] Configure Tunnel for WAC
- [ ] Configure Tunnel for Proxmox UI
- [ ] Configure Tunnel for dashboards
- [ ] Configure Tunnel for Git/CI services
### Zero Trust Policies
- [ ] Configure SSO (Azure AD/Okta)
- [ ] Configure MFA requirements
- [ ] Configure device posture checks
- [ ] Configure access policies
- [ ] Test external access
### WAF Configuration
- [ ] Configure WAF rules
- [ ] Test WAF protection
- [ ] Verify no inbound ports required
## Phase 8: Service VM Deployment
### Ubuntu VM Templates
- [ ] Create Ubuntu LTS template on Proxmox
- [ ] Install Azure Arc agent in template
- [ ] Configure base packages
- [ ] Create VM snapshots
### Service VM Deployment
- [ ] Deploy Cloudflare Tunnel VM (VLAN 99)
- [ ] Deploy Reverse Proxy VM (VLAN 30/99)
- [ ] Deploy Observability VM (VLAN 40)
- [ ] Deploy CI/CD VM (VLAN 50)
- [ ] Install Azure Arc agents on all VMs
### Service Configuration
- [ ] Configure Cloudflare Tunnel
- [ ] Configure reverse proxy (NGINX/Traefik)
- [ ] Configure observability stack (Prometheus/Grafana)
- [ ] Configure CI/CD (GitLab Runner/Jenkins)
## Phase 9: Verification and Testing
### Network Testing
- [ ] Test all WAN connections
- [ ] Test multi-WAN failover
- [ ] Test VLAN isolation
- [ ] Test inter-VLAN routing
- [ ] Test firewall rules
### Storage Testing
- [ ] Test storage read/write performance
- [ ] Test storage redundancy
- [ ] Test VM storage access
- [ ] Test storage exports
### Service Testing
- [ ] Test Cloudflare Tunnel access
- [ ] Test Azure Arc connectivity
- [ ] Test observability dashboards
- [ ] Test CI/CD pipelines
### Performance Testing
- [ ] Test QAT acceleration
- [ ] Test network throughput
- [ ] Test storage I/O
- [ ] Document performance metrics
## Phase 10: Documentation and Handoff
### Documentation
- [ ] Document all IP addresses
- [ ] Verify `.env` file contains all credentials (stored securely, not in version control)
- [ ] Document cable mappings
- [ ] Document VLAN configurations
- [ ] Document storage allocations
- [ ] Create network diagrams
- [ ] Create runbooks
- [ ] Verify `.env` is in `.gitignore` and not committed to repository
### Monitoring Setup
- [ ] Configure Grafana dashboards
- [ ] Configure Prometheus alerts
- [ ] Configure Azure Monitor alerts
- [ ] Test alerting
### Security Hardening
- [ ] Review firewall rules
- [ ] Review access policies
- [ ] Create RBAC accounts for Proxmox (replace root usage)
- [ ] Create service accounts for automation
- [ ] Create operator accounts with appropriate roles
- [ ] Generate API tokens for service accounts
- [ ] Document RBAC account usage (see docs/security/proxmox-rbac.md)
- [ ] Review secret management
- [ ] Perform security scan
## Post-Installation Tasks
### Ongoing Maintenance
- [ ] Schedule regular backups
- [ ] Schedule firmware updates
- [ ] Schedule driver updates
- [ ] Schedule OS updates
- [ ] Schedule security patches
### Monitoring
- [ ] Review monitoring dashboards daily
- [ ] Review Azure Arc status
- [ ] Review Cloudflare Tunnel status
- [ ] Review storage health
- [ ] Review network performance
## Troubleshooting Reference
### Common Issues
**Issue:** NIC not detected
- Check PCIe slot connection
- Check BIOS settings
- Update driver
**Issue:** Storage shelves not detected
- Check cable connections
- Check HBA firmware
- Check shelf power
**Issue:** Azure Arc not connecting
- Check network connectivity
- Check proxy settings
- Check Azure credentials
**Issue:** Cloudflare Tunnel not working
- Check cloudflared service
- Check Tunnel configuration
- Check Zero Trust policies
## Related Documentation
- [Complete Architecture](complete-architecture.md) - Full architecture overview
- [Hardware BOM](hardware-bom.md) - Complete bill of materials
- [PCIe Allocation](pcie-allocation.md) - Slot allocation map
- [Network Topology](network-topology.md) - VLAN/IP schema
- [Driver Matrix](driver-matrix.md) - Driver versions