Files
smom-dbis-138/docs/archive/status-reports/operations-legacy/THE_FIX.md

3.1 KiB

The Fix - Deployment Issue Resolution

Problem

  • 7 Failed Clusters: Stopped during Terraform updates (cannot be updated, must be deleted)
  • 16 Canceled Clusters: Deployment interrupted (exist in Azure but not in Terraform state)

Solution

Delete all problematic clusters and recreate with Terraform

Quick Fix (Automated)

Run this single command:

./scripts/azure/fix-deployment-issues.sh

This script will:

  1. Delete all 7 failed clusters
  2. Delete all 16 canceled clusters
  3. Clean Terraform state
  4. Re-run Terraform deployment (recreates all clusters)
  5. Verify deployment status

Estimated Time: 20-40 minutes

Manual Fix (Step-by-Step)

Step 1: Delete Failed Clusters

# List failed clusters
az aks list --subscription fc08d829-4f14-413d-ab27-ce024425db0b \
  --query "[?contains(name, 'az-p-') && provisioningState == 'Failed'].{name:name, rg:resourceGroup}" -o table

# Delete each failed cluster
az aks delete --resource-group <rg> --name <name> --subscription fc08d829-4f14-413d-ab27-ce024425db0b --yes

Step 2: Delete Canceled Clusters

# List canceled clusters
az aks list --subscription fc08d829-4f14-413d-ab27-ce024425db0b \
  --query "[?contains(name, 'az-p-') && provisioningState == 'Canceled'].{name:name, rg:resourceGroup}" -o table

# Delete each canceled cluster
az aks delete --resource-group <rg> --name <name> --subscription fc08d829-4f14-413d-ab27-ce024425db0b --yes

Step 3: Re-run Terraform

cd terraform/well-architected/cloud-sovereignty
terraform init -upgrade
terraform apply -parallelism=128 -auto-approve

Why This Works

  1. Failed Clusters: Are in terminal "stopped state" - cannot be updated, must be deleted
  2. Canceled Clusters: Cause state mismatch - deleting ensures clean recreation
  3. Recreation: Terraform will create all clusters fresh with correct configuration
  4. Clean State: No import complexity, consistent configuration

Prevention

After fix, add these safeguards:

  1. Check Power State Before Updates:

    az aks show --resource-group <rg> --name <name> --query powerState
    
  2. Prevent Manual Stops: Lock resource groups or use policies

  3. State Management: Use remote state backend (Azure Storage)

  4. Monitoring: Watch for stopped clusters during deployment

Expected Outcome

After fix:

  • All 24 clusters created successfully
  • All clusters in "Succeeded" state
  • Terraform state matches Azure reality
  • Ready for next deployment steps (Kubernetes, Besu, contracts)

Verification

After fix completes, verify:

# Check cluster status
az aks list --subscription fc08d829-4f14-413d-ab27-ce024425db0b \
  --query "[?contains(name, 'az-p-')].{name:name, state:provisioningState, power:powerState.code}" -o table

# Expected: All should show "Succeeded" state and "Running" power

Next Steps After Fix

Once all clusters are ready:

./scripts/deployment/wait-and-run-all-next-steps.sh

This will:

  1. Configure Kubernetes
  2. Deploy Besu network
  3. Deploy smart contracts
  4. Set up monitoring