Update documentation structure and enhance .gitignore
- Added generated index files and report directories to .gitignore to prevent unnecessary tracking of transient files. - Updated README links to reflect new documentation paths for better navigation. - Improved documentation organization by ensuring all links point to the correct locations, enhancing user experience and accessibility.
This commit is contained in:
221
docs/deployment/DEPLOYMENT_NEXT_STEPS.md
Normal file
221
docs/deployment/DEPLOYMENT_NEXT_STEPS.md
Normal file
@@ -0,0 +1,221 @@
|
||||
# Deployment Next Steps
|
||||
|
||||
**Date**: 2025-12-09
|
||||
**Status**: ⚠️ **LOCK ISSUE - MANUAL RESOLUTION REQUIRED**
|
||||
|
||||
---
|
||||
|
||||
## Current Situation
|
||||
|
||||
### ✅ Completed
|
||||
1. **Provider Configuration**: ✅ Verified and working
|
||||
2. **VM Resource Created**: ✅ basic-vm-001 (VMID 100)
|
||||
3. **Deployment Initiated**: ✅ VM created in Proxmox
|
||||
|
||||
### ⚠️ Blocking Issue
|
||||
**VM Lock Timeout**: Configuration update blocked by Proxmox lock file
|
||||
|
||||
**Error**: `can't lock file '/var/lock/qemu-server/lock-100.conf' - got timeout`
|
||||
|
||||
---
|
||||
|
||||
## Immediate Action Required
|
||||
|
||||
### Step 1: Resolve Lock on Proxmox Node
|
||||
|
||||
**Access the Proxmox node and clear the lock:**
|
||||
|
||||
```bash
|
||||
# Connect to Proxmox node (replace with actual IP/hostname)
|
||||
ssh root@<proxmox-node-ip>
|
||||
|
||||
# Check VM status
|
||||
qm status 100
|
||||
|
||||
# Unlock the VM
|
||||
qm unlock 100
|
||||
|
||||
# If unlock doesn't work, remove lock file
|
||||
rm -f /var/lock/qemu-server/lock-100.conf
|
||||
|
||||
# Verify lock is cleared
|
||||
ls -la /var/lock/qemu-server/lock-100.conf
|
||||
```
|
||||
|
||||
**Note**: If you don't have direct SSH access, you may need to:
|
||||
- Use Proxmox web UI
|
||||
- Access via console
|
||||
- Use another method to access the node
|
||||
|
||||
### Step 2: Verify Image Availability
|
||||
|
||||
**While on the Proxmox node, verify the image exists:**
|
||||
|
||||
```bash
|
||||
# Check for image
|
||||
find /var/lib/vz/template/iso -name "ubuntu-22.04-cloud.img"
|
||||
pvesm list local-lvm | grep ubuntu-22.04-cloud
|
||||
|
||||
# If missing, download it
|
||||
cd /var/lib/vz/template/iso
|
||||
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
|
||||
mv jammy-server-cloudimg-amd64.img ubuntu-22.04-cloud.img
|
||||
```
|
||||
|
||||
### Step 3: Monitor Automatic Retry
|
||||
|
||||
**After clearing the lock, the provider will automatically retry:**
|
||||
|
||||
```bash
|
||||
# Watch VM status
|
||||
kubectl get proxmoxvm basic-vm-001 -w
|
||||
|
||||
# Watch provider logs
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50 -f
|
||||
```
|
||||
|
||||
**Expected Timeline**: 1-5 minutes after lock is cleared
|
||||
|
||||
---
|
||||
|
||||
## After Lock Resolution
|
||||
|
||||
### Expected Sequence
|
||||
|
||||
1. **Provider retries** configuration update (automatic)
|
||||
2. **VM configuration** completes successfully
|
||||
3. **Image import** (if needed) completes
|
||||
4. **Boot order** set correctly
|
||||
5. **Cloud-init** configured
|
||||
6. **VM boots** successfully
|
||||
7. **VM reaches "running" state**
|
||||
8. **IP address assigned**
|
||||
9. **Ready condition becomes "True"**
|
||||
|
||||
### Verification Steps
|
||||
|
||||
Once VM is running:
|
||||
|
||||
```bash
|
||||
# Get VM IP
|
||||
IP=$(kubectl get proxmoxvm basic-vm-001 -o jsonpath='{.status.networkInterfaces[0].ipAddress}')
|
||||
|
||||
# Check cloud-init logs
|
||||
ssh admin@$IP "cat /var/log/cloud-init-output.log | tail -50"
|
||||
|
||||
# Verify services
|
||||
ssh admin@$IP "systemctl status qemu-guest-agent chrony unattended-upgrades"
|
||||
|
||||
# Test SSH access
|
||||
ssh admin@$IP "hostname && uptime"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## If Lock Resolution Fails
|
||||
|
||||
### Alternative: Delete and Redeploy
|
||||
|
||||
If the lock cannot be cleared:
|
||||
|
||||
```bash
|
||||
# 1. Delete Kubernetes resource
|
||||
kubectl delete proxmoxvm basic-vm-001
|
||||
|
||||
# 2. On Proxmox node, force delete VM
|
||||
ssh root@<proxmox-node> "qm destroy 100 --purge --skiplock"
|
||||
|
||||
# 3. Clean up locks
|
||||
ssh root@<proxmox-node> "rm -f /var/lock/qemu-server/lock-100.conf"
|
||||
|
||||
# 4. Wait for cleanup
|
||||
sleep 10
|
||||
|
||||
# 5. Redeploy
|
||||
kubectl apply -f examples/production/basic-vm.yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Long-term Solutions
|
||||
|
||||
### 1. Code Enhancement
|
||||
|
||||
**Add lock handling to provider code:**
|
||||
|
||||
- Detect lock errors in `UpdateVM`
|
||||
- Automatically call `qm unlock` before retry
|
||||
- Increase timeout for lock operations
|
||||
- Add exponential backoff for lock retries
|
||||
|
||||
**File**: `crossplane-provider-proxmox/pkg/proxmox/client.go`
|
||||
|
||||
### 2. Pre-deployment Checks
|
||||
|
||||
**Add validation before VM creation:**
|
||||
|
||||
- Check for existing locks on target node
|
||||
- Verify no conflicting operations
|
||||
- Ensure Proxmox node is healthy
|
||||
|
||||
### 3. Deployment Strategy
|
||||
|
||||
**For full deployment:**
|
||||
|
||||
- Deploy VMs sequentially (not in parallel)
|
||||
- Add delays between deployments (30-60 seconds)
|
||||
- Monitor each deployment before proceeding
|
||||
- Implement retry logic with lock handling
|
||||
|
||||
---
|
||||
|
||||
## Full Deployment Plan (After Test Success)
|
||||
|
||||
### Phase 1: Infrastructure (2 VMs)
|
||||
1. nginx-proxy-vm.yaml
|
||||
2. cloudflare-tunnel-vm.yaml
|
||||
|
||||
### Phase 2: SMOM-DBIS-138 Core (8 VMs)
|
||||
3-6. validator-01 through validator-04
|
||||
7-10. sentry-01 through sentry-04
|
||||
|
||||
### Phase 3: SMOM-DBIS-138 Services (8 VMs)
|
||||
11-14. rpc-node-01 through rpc-node-04
|
||||
15. services.yaml
|
||||
16. blockscout.yaml
|
||||
17. monitoring.yaml
|
||||
18. management.yaml
|
||||
|
||||
### Phase 4: Phoenix VMs (8 VMs)
|
||||
19-26. All Phoenix VMs
|
||||
|
||||
### Phase 5: Template VMs (2 VMs - Optional)
|
||||
27. medium-vm.yaml
|
||||
28. large-vm.yaml
|
||||
|
||||
**Total**: 28 additional VMs after test VM
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
### Current Status
|
||||
- ✅ Provider: Working
|
||||
- ✅ VM Created: Yes (VMID 100)
|
||||
- ⚠️ Configuration: Blocked by lock
|
||||
- ⚠️ State: Stopped
|
||||
|
||||
### Required Action
|
||||
**Manual lock resolution on Proxmox node**
|
||||
|
||||
### After Resolution
|
||||
- Provider will automatically retry
|
||||
- VM should complete configuration
|
||||
- VM should boot successfully
|
||||
- Full deployment can proceed
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-09
|
||||
**Status**: ⚠️ **WAITING FOR MANUAL LOCK RESOLUTION**
|
||||
|
||||
Reference in New Issue
Block a user