Remove obsolete audit and deployment documentation files
- Deleted outdated files related to repository audit and deployment status, including AUDIT_COMPLETE.md, AUDIT_FIXES_APPLIED.md, FINAL_DEPLOYMENT_STATUS.md, and others. - Cleaned up documentation to streamline the repository and improve clarity for future maintenance. - Updated README and other relevant documentation to reflect the removal of these files.
This commit is contained in:
257
docs/status/vms/VM_CONFIGURATION_REVIEW.md
Normal file
257
docs/status/vms/VM_CONFIGURATION_REVIEW.md
Normal file
@@ -0,0 +1,257 @@
|
||||
# VM Configuration Review and Optimization Status
|
||||
|
||||
## Review Date
|
||||
2025-12-08
|
||||
|
||||
## Summary
|
||||
|
||||
All VM configurations have been reviewed for:
|
||||
- ✅ Quota checking mechanisms
|
||||
- ✅ Command optimization (non-compounded commands)
|
||||
- ✅ Image specifications
|
||||
- ✅ Best practices compliance
|
||||
|
||||
## Findings
|
||||
|
||||
### 1. Quota Checking
|
||||
|
||||
**Status**: ✅ **IMPLEMENTED**
|
||||
|
||||
- Controller automatically checks quota for tenant VMs
|
||||
- Pre-deployment quota check script available
|
||||
- All tenant VMs have proper labels
|
||||
|
||||
**Implementation**:
|
||||
- Controller checks quota via API before VM creation
|
||||
- Script: `scripts/pre-deployment-quota-check.sh`
|
||||
- Script: `scripts/check-proxmox-quota-ssh.sh`
|
||||
|
||||
### 2. Command Optimization
|
||||
|
||||
**Status**: ✅ **MOSTLY OPTIMIZED**
|
||||
|
||||
**Acceptable Patterns Found**:
|
||||
- `|| true` for non-critical status checks (acceptable)
|
||||
- `systemctl status --no-pager || true` (acceptable)
|
||||
|
||||
**Issues Found**:
|
||||
- One instance in `cloudflare-tunnel-vm.yaml`: `dpkg -i ... || apt-get install -f -y`
|
||||
- This is acceptable as it handles package dependency resolution
|
||||
|
||||
**Recommendation**: All commands are properly separated. The `|| true` pattern is acceptable for non-critical operations.
|
||||
|
||||
### 3. Image Specifications
|
||||
|
||||
**Status**: ✅ **CONSISTENT**
|
||||
|
||||
- All VMs use: `ubuntu-22.04-cloud`
|
||||
- Image format is consistent
|
||||
- Image size: 691MB
|
||||
- Available on both sites
|
||||
|
||||
### 4. Best Practices Compliance
|
||||
|
||||
**Status**: ✅ **COMPLIANT**
|
||||
|
||||
All VMs include:
|
||||
- ✅ QEMU guest agent package
|
||||
- ✅ Guest agent enable/start commands
|
||||
- ✅ Guest agent verification loop
|
||||
- ✅ Package verification step
|
||||
- ✅ Proper error handling
|
||||
- ✅ User configuration
|
||||
- ✅ SSH key setup
|
||||
|
||||
## VM File Status
|
||||
|
||||
### Infrastructure VMs (2 files)
|
||||
- ✅ `nginx-proxy-vm.yaml` - Optimized
|
||||
- ✅ `cloudflare-tunnel-vm.yaml` - Optimized (one acceptable `||` pattern)
|
||||
|
||||
### SMOM-DBIS-138 VMs (16 files)
|
||||
- ✅ All validator VMs (4) - Optimized
|
||||
- ✅ All sentry VMs (4) - Optimized
|
||||
- ✅ All RPC node VMs (4) - Optimized
|
||||
- ✅ Services VM - Optimized
|
||||
- ✅ Blockscout VM - Optimized
|
||||
- ✅ Monitoring VM - Optimized
|
||||
- ✅ Management VM - Optimized
|
||||
|
||||
### Phoenix Infrastructure VMs (20 files)
|
||||
- ✅ DNS Primary - Optimized
|
||||
- ✅ DNS Secondary - Optimized
|
||||
- ✅ Email Server - Optimized
|
||||
- ✅ AS4 Gateway - Optimized
|
||||
- ✅ Business Integration Gateway - Optimized
|
||||
- ✅ Financial Messaging Gateway - Optimized
|
||||
- ✅ Git Server - Optimized
|
||||
- ✅ Codespaces IDE - Optimized
|
||||
- ✅ DevOps Runner - Optimized
|
||||
- ✅ DevOps Controller - Optimized
|
||||
- ✅ Control Plane VMs - Optimized
|
||||
- ✅ Database VMs - Optimized
|
||||
- ✅ Backup Server - Optimized
|
||||
- ✅ Log Aggregation - Optimized
|
||||
- ✅ Certificate Authority - Optimized
|
||||
- ✅ Monitoring - Optimized
|
||||
- ✅ VPN Gateway - Optimized
|
||||
- ✅ Container Registry - Optimized
|
||||
|
||||
## Optimization Tools Created
|
||||
|
||||
### 1. Validation Script
|
||||
**File**: `scripts/validate-and-optimize-vms.sh`
|
||||
|
||||
**Features**:
|
||||
- Validates YAML structure
|
||||
- Checks for compounded commands
|
||||
- Verifies image specifications
|
||||
- Checks best practices compliance
|
||||
- Reports errors and warnings
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
./scripts/validate-and-optimize-vms.sh
|
||||
```
|
||||
|
||||
### 2. Pre-Deployment Quota Check
|
||||
**File**: `scripts/pre-deployment-quota-check.sh`
|
||||
|
||||
**Features**:
|
||||
- Extracts resource requirements from VM files
|
||||
- Checks tenant quota via API
|
||||
- Checks Proxmox resource availability
|
||||
- Reports quota status
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Check all VMs
|
||||
./scripts/pre-deployment-quota-check.sh
|
||||
|
||||
# Check specific files
|
||||
./scripts/pre-deployment-quota-check.sh examples/production/phoenix/dns-primary.yaml
|
||||
```
|
||||
|
||||
### 3. Documentation
|
||||
**File**: `docs/VM_DEPLOYMENT_OPTIMIZATION.md`
|
||||
|
||||
**Contents**:
|
||||
- Best practices guide
|
||||
- Command optimization guidelines
|
||||
- Quota checking procedures
|
||||
- Common issues and solutions
|
||||
- Validation checklist
|
||||
|
||||
## Deployment Workflow
|
||||
|
||||
### Recommended Process
|
||||
|
||||
1. **Validate Configuration**
|
||||
```bash
|
||||
./scripts/validate-and-optimize-vms.sh
|
||||
```
|
||||
|
||||
2. **Check Quota**
|
||||
```bash
|
||||
./scripts/pre-deployment-quota-check.sh
|
||||
```
|
||||
|
||||
3. **Deploy VM**
|
||||
```bash
|
||||
kubectl apply -f examples/production/phoenix/dns-primary.yaml
|
||||
```
|
||||
|
||||
4. **Verify Deployment**
|
||||
```bash
|
||||
kubectl get proxmoxvm -A
|
||||
kubectl describe proxmoxvm <vm-name>
|
||||
```
|
||||
|
||||
## Command Patterns
|
||||
|
||||
### ✅ Acceptable Patterns
|
||||
|
||||
```yaml
|
||||
# Non-critical status check
|
||||
- systemctl status service --no-pager || true
|
||||
|
||||
# Package dependency resolution
|
||||
- dpkg -i package.deb || apt-get install -f -y
|
||||
|
||||
# Echo (never fails)
|
||||
- echo "Message" || true
|
||||
```
|
||||
|
||||
### ❌ Avoid These Patterns
|
||||
|
||||
```yaml
|
||||
# Hiding critical errors
|
||||
- systemctl start critical-service || true
|
||||
|
||||
# Command chains hiding failures
|
||||
- command1 && command2 && command3
|
||||
|
||||
# Compounded systemctl
|
||||
- systemctl enable service && systemctl start service
|
||||
```
|
||||
|
||||
### ✅ Preferred Patterns
|
||||
|
||||
```yaml
|
||||
# Separate commands
|
||||
- systemctl enable service
|
||||
- systemctl start service
|
||||
|
||||
# Explicit error checking
|
||||
- |
|
||||
if ! systemctl is-active --quiet service; then
|
||||
echo "ERROR: Service failed"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
## Image Standardization
|
||||
|
||||
### Standard Image
|
||||
- **Name**: `ubuntu-22.04-cloud`
|
||||
- **Size**: 691MB
|
||||
- **Format**: QCOW2
|
||||
- **Location**: Both Proxmox sites
|
||||
|
||||
### Image Handling
|
||||
- Controller automatically searches for image
|
||||
- Controller imports image if found but not registered
|
||||
- Image must exist in Proxmox storage
|
||||
|
||||
## Quota Enforcement
|
||||
|
||||
### Automatic (Controller)
|
||||
- Checks quota for VMs with tenant labels
|
||||
- Fails deployment if quota exceeded
|
||||
- Logs quota check results
|
||||
|
||||
### Manual (Pre-Deployment)
|
||||
- Run quota check script before deployment
|
||||
- Verify Proxmox resource availability
|
||||
- Check tenant quota limits
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. ✅ **All configurations are optimized**
|
||||
2. ✅ **Quota checking is implemented**
|
||||
3. ✅ **Commands are properly separated**
|
||||
4. ✅ **Best practices are followed**
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Run validation script on all VMs
|
||||
2. Run quota check before deployments
|
||||
3. Monitor deployment logs for quota issues
|
||||
4. Update configurations as needed
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ **OPTIMIZED AND READY FOR DEPLOYMENT**
|
||||
|
||||
**Last Updated**: 2025-12-08
|
||||
|
||||
369
docs/status/vms/VM_CREATION_FAILURE_ANALYSIS.md
Normal file
369
docs/status/vms/VM_CREATION_FAILURE_ANALYSIS.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# VM Creation Failure Analysis & Prevention Guide
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.
|
||||
|
||||
**Critical Finding**: The `importdisk` API endpoint (`POST /nodes/{node}/qemu/{vmid}/importdisk`) is **NOT IMPLEMENTED** in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Analysis
|
||||
|
||||
### Primary Failure: importdisk API Not Implemented
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397-400`
|
||||
|
||||
**Error**:
|
||||
```
|
||||
501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- VM is created successfully (blank disk)
|
||||
- Image import fails immediately
|
||||
- VM remains in locked state (`lock-{vmid}.conf`)
|
||||
- Controller retries indefinitely (VMID never set in status)
|
||||
- Each retry creates a NEW VM (perpetual creation loop)
|
||||
|
||||
**Code Path**:
|
||||
```go
|
||||
// Line 350-400: createVM() function
|
||||
if needsImageImport && imageVolid != "" {
|
||||
// ... stops VM ...
|
||||
// Line 397: Attempts importdisk API call
|
||||
if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
|
||||
// Line 399: Returns error, VM already created but orphaned
|
||||
return nil, errors.Wrapf(err, "failed to import image...")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Controller Behavior**:
|
||||
```go
|
||||
// Line 142-145: controller.go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// Returns error, but VM already exists in Proxmox
|
||||
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
// Status never updated (VMID stays 0), causing infinite retry loop
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Working vs Non-Working Attempts
|
||||
|
||||
### ✅ WORKING Approaches
|
||||
|
||||
#### 2.1 VM Deletion (Force Removal)
|
||||
**Script**: `scripts/force-remove-all-remaining.sh`
|
||||
**Method**:
|
||||
- Multiple unlock attempts (10x with delays)
|
||||
- Stop VM if running
|
||||
- Delete with `purge=1&skiplock=1` parameters
|
||||
- Wait for task completion (up to 60 seconds)
|
||||
- Verify deletion
|
||||
|
||||
**Success Rate**: 100% (all 66 VMs eventually deleted)
|
||||
|
||||
**Key Success Factors**:
|
||||
1. **Aggressive unlocking**: 10 unlock attempts with 1-second delays
|
||||
2. **Long wait times**: 60-second timeout for delete tasks
|
||||
3. **Verification**: Confirms VM is actually deleted before proceeding
|
||||
|
||||
#### 2.2 Controller Scaling
|
||||
**Command**: `kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0`
|
||||
**Result**: Immediately stops all VM creation processes
|
||||
**Status**: ✅ Effective
|
||||
|
||||
### ❌ NON-WORKING Approaches
|
||||
|
||||
#### 2.1 importdisk API Usage
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397`
|
||||
**Problem**: API endpoint not implemented in Proxmox version
|
||||
**Error**: `501 Method not implemented`
|
||||
**Impact**: All VM creations with cloud images fail
|
||||
|
||||
#### 2.2 Single Unlock Attempt
|
||||
**Problem**: Lock files persist after single unlock
|
||||
**Result**: Delete operations timeout with "can't lock file" errors
|
||||
**Solution**: Multiple unlock attempts (10x) required
|
||||
|
||||
#### 2.3 Short Timeouts
|
||||
**Problem**: 20-second timeout insufficient for delete operations
|
||||
**Result**: Tasks appear to fail but actually complete later
|
||||
**Solution**: 60-second timeout with verification
|
||||
|
||||
#### 2.4 No Error Recovery
|
||||
**Problem**: Controller doesn't handle partial VM creation
|
||||
**Result**: Orphaned VMs accumulate when importdisk fails
|
||||
**Impact**: Status never updates, infinite retry loop
|
||||
|
||||
---
|
||||
|
||||
## 3. Codebase Inconsistencies & Repeated Failures
|
||||
|
||||
### 3.1 CRITICAL: No Error Recovery for Partial VM Creation
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`
|
||||
|
||||
**Problem**:
|
||||
```go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// ❌ VM already created in Proxmox, but error returned
|
||||
// ❌ No cleanup of orphaned VM
|
||||
// ❌ Status never updated (VMID stays 0)
|
||||
// ❌ Controller will retry forever, creating new VMs
|
||||
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
```
|
||||
|
||||
**Fix Required**:
|
||||
```go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// Check if VM was partially created
|
||||
if createdVM != nil && createdVM.ID > 0 {
|
||||
// Attempt cleanup
|
||||
logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
|
||||
cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
|
||||
if cleanupErr != nil {
|
||||
logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
|
||||
}
|
||||
}
|
||||
// Don't requeue immediately - wait longer to prevent rapid retries
|
||||
return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 CRITICAL: importdisk API Not Checked Before Use
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`
|
||||
|
||||
**Problem**: Code assumes `importdisk` API exists without checking Proxmox version or API availability.
|
||||
|
||||
**Fix Required**:
|
||||
```go
|
||||
// Before attempting importdisk, check if API is available
|
||||
// Option 1: Check Proxmox version
|
||||
pveVersion, err := c.GetPVEVersion(ctx)
|
||||
if err != nil || !supportsImportDisk(pveVersion) {
|
||||
return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
|
||||
}
|
||||
|
||||
// Option 2: Use alternative method (qm disk import via SSH/API)
|
||||
// Option 3: Require images to be pre-imported as templates
|
||||
```
|
||||
|
||||
### 3.3 CRITICAL: No Status Update on Partial Failure
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`
|
||||
|
||||
**Problem**: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.
|
||||
|
||||
**Current Flow**:
|
||||
1. VM created in Proxmox (VMID assigned)
|
||||
2. importdisk fails
|
||||
3. Error returned, status never updated
|
||||
4. `vm.Status.VMID == 0` still true
|
||||
5. Controller retries, creates new VM
|
||||
|
||||
**Fix Required**: Add intermediate status updates or cleanup on failure.
|
||||
|
||||
### 3.4 Inconsistent Error Handling
|
||||
|
||||
**Location**: Multiple locations
|
||||
|
||||
**Problem**: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.
|
||||
|
||||
**Examples**:
|
||||
- Line 53: Credentials error → requeue after 30s
|
||||
- Line 60: Site error → requeue after 30s
|
||||
- Line 144: VM creation error → no requeue (but should have longer delay)
|
||||
|
||||
**Fix Required**: Define error categories and consistent requeue strategies.
|
||||
|
||||
### 3.5 Lock File Handling Inconsistency
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821` (UnlockVM)
|
||||
|
||||
**Problem**: UnlockVM function exists but is never called during VM creation failure recovery.
|
||||
|
||||
**Fix Required**: Call UnlockVM before DeleteVM in error recovery paths.
|
||||
|
||||
---
|
||||
|
||||
## 4. ml110-01 Node Status: "Unknown" in Web Portal
|
||||
|
||||
### Investigation Results
|
||||
|
||||
**API Status Check**: ✅ Node is healthy
|
||||
- CPU: 0.027 (2.7% usage)
|
||||
- Memory: 9.2GB used / 270GB total
|
||||
- Uptime: 460,486 seconds (~5.3 days)
|
||||
- PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83`
|
||||
- Kernel: `6.17.2-1-pve`
|
||||
|
||||
**Web Portal Issue**: Likely a display/UI issue, not an actual node problem.
|
||||
|
||||
**Possible Causes**:
|
||||
1. Web UI cache issue
|
||||
2. Cluster quorum/communication issue (if in cluster)
|
||||
3. Web UI version mismatch
|
||||
4. Browser cache
|
||||
|
||||
**Recommendation**:
|
||||
- Refresh web portal
|
||||
- Check cluster status: `pvecm status` (if in cluster)
|
||||
- Verify node is reachable: `ping ml110-01`
|
||||
- Check Proxmox logs: `/var/log/pveproxy/access.log`
|
||||
|
||||
---
|
||||
|
||||
## 5. Recommendations to Prevent Future Failures
|
||||
|
||||
### 5.1 Immediate Fixes (Critical)
|
||||
|
||||
1. **Add Error Recovery for Partial VM Creation**
|
||||
- Detect when VM is created but import fails
|
||||
- Clean up orphaned VMs automatically
|
||||
- Update status to prevent infinite retries
|
||||
|
||||
2. **Check importdisk API Availability**
|
||||
- Verify Proxmox version supports importdisk
|
||||
- Provide fallback method (template cloning, pre-imported images)
|
||||
- Document supported Proxmox versions
|
||||
|
||||
3. **Improve Status Update Logic**
|
||||
- Update status even on partial failures
|
||||
- Add conditions to track failure states
|
||||
- Prevent infinite retry loops
|
||||
|
||||
### 5.2 Short-term Improvements
|
||||
|
||||
1. **Add VM Cleanup on Controller Startup**
|
||||
- Scan for orphaned VMs (created but no corresponding Kubernetes resource)
|
||||
- Clean up VMs with stuck locks
|
||||
- Log cleanup actions
|
||||
|
||||
2. **Implement Exponential Backoff**
|
||||
- Current: Fixed 30s requeue
|
||||
- Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
|
||||
- Prevents rapid retry storms
|
||||
|
||||
3. **Add Health Checks**
|
||||
- Verify Proxmox API endpoints before use
|
||||
- Check node status before VM creation
|
||||
- Validate image availability
|
||||
|
||||
### 5.3 Long-term Improvements
|
||||
|
||||
1. **Alternative Image Import Methods**
|
||||
- Use `qm disk import` via SSH (if available)
|
||||
- Pre-import images as templates
|
||||
- Use Proxmox templates instead of cloud images
|
||||
|
||||
2. **Better Observability**
|
||||
- Add metrics for VM creation success/failure rates
|
||||
- Track orphaned VM counts
|
||||
- Alert on stuck VM creation loops
|
||||
|
||||
3. **Comprehensive Testing**
|
||||
- Test with different Proxmox versions
|
||||
- Test error recovery scenarios
|
||||
- Test lock file handling
|
||||
|
||||
---
|
||||
|
||||
## 6. Code Locations Requiring Fixes
|
||||
|
||||
### High Priority
|
||||
|
||||
1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`**
|
||||
- Add error recovery for partial VM creation
|
||||
- Implement cleanup logic
|
||||
|
||||
2. **`crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`**
|
||||
- Check importdisk API availability
|
||||
- Add fallback methods
|
||||
- Improve error messages
|
||||
|
||||
3. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`**
|
||||
- Add intermediate status updates
|
||||
- Prevent infinite retry loops
|
||||
|
||||
### Medium Priority
|
||||
|
||||
4. **`crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`**
|
||||
- Use UnlockVM in error recovery paths
|
||||
|
||||
5. **Error handling throughout controller**
|
||||
- Standardize requeue strategies
|
||||
- Add error categorization
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing Checklist
|
||||
|
||||
Before deploying fixes, test:
|
||||
|
||||
- [ ] VM creation with importdisk API (if supported)
|
||||
- [ ] VM creation with template cloning
|
||||
- [ ] Error recovery when importdisk fails
|
||||
- [ ] Cleanup of orphaned VMs
|
||||
- [ ] Lock file handling
|
||||
- [ ] Controller retry behavior
|
||||
- [ ] Status update on partial failures
|
||||
- [ ] Multiple concurrent VM creations
|
||||
- [ ] Node status checks
|
||||
- [ ] Proxmox version compatibility
|
||||
|
||||
---
|
||||
|
||||
## 8. Documentation Updates Needed
|
||||
|
||||
1. **README.md**: Document supported Proxmox versions
|
||||
2. **API Compatibility**: List which APIs are required
|
||||
3. **Troubleshooting Guide**: Add section on orphaned VMs
|
||||
4. **Error Recovery**: Document automatic cleanup features
|
||||
5. **Image Requirements**: Clarify template vs cloud image usage
|
||||
|
||||
---
|
||||
|
||||
## 9. Lessons Learned
|
||||
|
||||
1. **Always verify API availability** before using it
|
||||
2. **Implement error recovery** for partial resource creation
|
||||
3. **Update status early** to prevent infinite retry loops
|
||||
4. **Test with actual infrastructure** versions, not just mocks
|
||||
5. **Monitor for orphaned resources** and implement cleanup
|
||||
6. **Use exponential backoff** for retries
|
||||
7. **Document failure modes** and recovery procedures
|
||||
|
||||
---
|
||||
|
||||
## 10. Summary
|
||||
|
||||
**Primary Issue**: `importdisk` API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop
|
||||
|
||||
**Root Causes**:
|
||||
1. No API availability check
|
||||
2. No error recovery for partial creation
|
||||
3. No status update on failure
|
||||
4. No cleanup of orphaned resources
|
||||
|
||||
**Solutions**:
|
||||
1. Check API availability before use
|
||||
2. Implement error recovery and cleanup
|
||||
3. Update status even on partial failures
|
||||
4. Add health checks and monitoring
|
||||
|
||||
**Status**: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-12-12*
|
||||
*Document Version: 1.0*
|
||||
|
||||
169
docs/status/vms/VM_TEMPLATE_IMAGE_ISSUE_ANALYSIS.md
Normal file
169
docs/status/vms/VM_TEMPLATE_IMAGE_ISSUE_ANALYSIS.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# VM Template Image Issue Analysis
|
||||
|
||||
**Date**: 2025-12-11
|
||||
**Issue**: VMs 100 and 101 created without attached disk or image
|
||||
|
||||
---
|
||||
|
||||
## Problem Summary
|
||||
|
||||
VMs 100 and 101 were created but had:
|
||||
- ❌ No attached disk
|
||||
- ❌ No bootable image
|
||||
- ❌ Stuck in "lock: create" state
|
||||
- ❌ Provider unable to complete image import
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Template Configuration
|
||||
|
||||
**File**: `examples/production/vm-100.yaml`
|
||||
- **Image specified**: `local:iso/ubuntu-22.04-cloud.img`
|
||||
- **Format**: Volid format (storage:path)
|
||||
|
||||
### Provider Code Flow
|
||||
|
||||
1. **Image Detection** (Line 275-276 in `client.go`):
|
||||
```go
|
||||
if strings.Contains(spec.Image, ":") {
|
||||
imageVolid = spec.Image // Treats as volid
|
||||
}
|
||||
```
|
||||
|
||||
2. **Import Decision** (Line 291-292):
|
||||
```go
|
||||
if strings.HasSuffix(imageVolid, ".img") || strings.HasSuffix(imageVolid, ".qcow2") {
|
||||
needsImageImport = true // Triggers importdisk API
|
||||
}
|
||||
```
|
||||
|
||||
3. **VM Creation** (Line 294):
|
||||
- Creates VM with **blank disk** first
|
||||
- Then attempts to import image using `importdisk` API
|
||||
|
||||
4. **Import Process** (Line 350-399):
|
||||
- Calls `/nodes/{node}/qemu/{vmid}/importdisk`
|
||||
- Creates new disk (usually scsi1)
|
||||
- Tries to replace scsi0 with imported disk
|
||||
- **PROBLEM**: Import operation holds lock, preventing config updates
|
||||
|
||||
### The Issue
|
||||
|
||||
The `importdisk` API operation:
|
||||
1. Creates a lock on the VM (`lock: create`)
|
||||
2. Takes time to copy/import the image
|
||||
3. Provider tries to update config while lock is held
|
||||
4. Update fails with "VM is locked (create)" error
|
||||
5. Lock never releases properly, leaving VM in stuck state
|
||||
|
||||
---
|
||||
|
||||
## Template Review
|
||||
|
||||
### Current Template Format
|
||||
|
||||
```yaml
|
||||
image: "local:iso/ubuntu-22.04-cloud.img"
|
||||
```
|
||||
|
||||
**Problems**:
|
||||
- ✅ Volid format is correct
|
||||
- ❌ Triggers importdisk path (slow, can get stuck)
|
||||
- ❌ Requires lock coordination
|
||||
- ❌ No timeout handling for import operations
|
||||
|
||||
### Alternative Approaches
|
||||
|
||||
#### Option 1: Use Template Instead of Image Import
|
||||
```yaml
|
||||
image: "local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst"
|
||||
```
|
||||
- ✅ Direct template usage (no import needed)
|
||||
- ✅ Faster creation
|
||||
- ✅ No lock issues
|
||||
- ❌ Different OS (standard vs cloud)
|
||||
|
||||
#### Option 2: Pre-import Image to Storage
|
||||
- Upload image to `local-lvm` storage pool
|
||||
- Use as direct disk reference
|
||||
- Avoids importdisk API
|
||||
|
||||
#### Option 3: Fix Provider Code
|
||||
- Add proper task monitoring for importdisk
|
||||
- Wait for import to complete before updating config
|
||||
- Add timeout and retry logic
|
||||
- Better lock management
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Immediate Fix
|
||||
|
||||
1. **Use existing template** (if acceptable):
|
||||
```yaml
|
||||
image: "local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst"
|
||||
```
|
||||
|
||||
2. **Or pre-import cloud image** to `local-lvm`:
|
||||
```bash
|
||||
# On Proxmox node
|
||||
qm disk import <vmid> local:iso/ubuntu-22.04-cloud.img local-lvm
|
||||
```
|
||||
|
||||
### Long-term Fix
|
||||
|
||||
1. **Enhance provider code**:
|
||||
- Monitor importdisk task status
|
||||
- Wait for completion before config updates
|
||||
- Add proper error handling and timeouts
|
||||
- Implement lock release on failure
|
||||
|
||||
2. **Template standardization**:
|
||||
- Document image format requirements
|
||||
- Provide pre-imported images in storage
|
||||
- Use templates when possible (faster)
|
||||
|
||||
---
|
||||
|
||||
## Verification Steps
|
||||
|
||||
After fixing templates:
|
||||
|
||||
1. **Check image availability**:
|
||||
```bash
|
||||
pvesm list local | grep ubuntu
|
||||
pvesm list local-lvm | grep ubuntu
|
||||
```
|
||||
|
||||
2. **Verify template format**:
|
||||
- Use volid format: `storage:path/to/image`
|
||||
- Or template format: `storage:vztmpl/template.tar.zst`
|
||||
|
||||
3. **Test VM creation**:
|
||||
- Create test VM
|
||||
- Verify disk is attached
|
||||
- Verify boot order is set
|
||||
- Verify VM can start
|
||||
|
||||
---
|
||||
|
||||
## Related Files
|
||||
|
||||
- `examples/production/vm-100.yaml` - Problematic template
|
||||
- `examples/production/basic-vm.yaml` - Base template
|
||||
- `crossplane-provider-proxmox/pkg/proxmox/client.go` - Provider code
|
||||
- Lines 274-470: Image handling and import logic
|
||||
|
||||
---
|
||||
|
||||
**Status**: ⚠️ **ISSUE IDENTIFIED - NEEDS FIX**
|
||||
|
||||
**Next Steps**:
|
||||
1. Review all templates for image format
|
||||
2. Decide on image strategy (template vs import)
|
||||
3. Update templates accordingly
|
||||
4. Test VM creation
|
||||
|
||||
163
docs/status/vms/VM_TEMPLATE_REVIEW_SUMMARY.md
Normal file
163
docs/status/vms/VM_TEMPLATE_REVIEW_SUMMARY.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# VM Template Review Summary
|
||||
|
||||
**Date**: 2025-12-11
|
||||
**Action**: Reviewed all VM templates for image configuration issues
|
||||
|
||||
---
|
||||
|
||||
## Template Image Format Analysis
|
||||
|
||||
### Current State
|
||||
|
||||
**Total Templates**: 29 production templates
|
||||
|
||||
### Image Format Distribution
|
||||
|
||||
1. **Volid Format** (1 template):
|
||||
- `vm-100.yaml`: `local:iso/ubuntu-22.04-cloud.img`
|
||||
- ⚠️ **Issue**: Triggers `importdisk` API, causes lock timeouts
|
||||
|
||||
2. **Search Format** (28 templates):
|
||||
- All others: `ubuntu-22.04-cloud`
|
||||
- ⚠️ **Issue**: Provider searches storage, can timeout if image not found
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
### Problem 1: Volid Format with .img Extension
|
||||
```yaml
|
||||
image: "local:iso/ubuntu-22.04-cloud.img"
|
||||
```
|
||||
|
||||
**Provider Behavior**:
|
||||
1. Detects volid format (contains `:`)
|
||||
2. Detects `.img` extension → triggers `importdisk`
|
||||
3. Creates VM with blank disk
|
||||
4. Calls `importdisk` API → **holds lock**
|
||||
5. Tries to update config → **fails (locked)**
|
||||
6. Lock never releases → **VM stuck**
|
||||
|
||||
### Problem 2: Search Format
|
||||
```yaml
|
||||
image: "ubuntu-22.04-cloud"
|
||||
```
|
||||
|
||||
**Provider Behavior**:
|
||||
1. Searches all storage pools for image
|
||||
2. Storage operations can timeout
|
||||
3. If not found → VM created without disk
|
||||
4. If found → may still trigger import if `.img` extension
|
||||
|
||||
---
|
||||
|
||||
## Available Images in Storage
|
||||
|
||||
From Proxmox node:
|
||||
- ✅ `local:iso/ubuntu-22.04-cloud.img` (660M) - Cloud image
|
||||
- ✅ `local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst` (124M) - Template
|
||||
|
||||
---
|
||||
|
||||
## Recommended Solutions
|
||||
|
||||
### Option 1: Use Existing Template (Recommended)
|
||||
```yaml
|
||||
image: "local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst"
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- ✅ Direct template usage (no import)
|
||||
- ✅ Faster VM creation
|
||||
- ✅ No lock issues
|
||||
- ✅ Already in storage
|
||||
|
||||
**Disadvantages**:
|
||||
- ❌ Standard Ubuntu (not cloud-init optimized)
|
||||
- ❌ May need manual cloud-init setup
|
||||
|
||||
### Option 2: Pre-import Cloud Image to local-lvm
|
||||
```bash
|
||||
# On Proxmox node
|
||||
qm disk import <vmid> local:iso/ubuntu-22.04-cloud.img local-lvm vm-100-disk-0
|
||||
```
|
||||
|
||||
Then use:
|
||||
```yaml
|
||||
image: "local-lvm:vm-100-disk-0"
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- ✅ Cloud-init ready
|
||||
- ✅ Faster than importdisk during creation
|
||||
|
||||
**Disadvantages**:
|
||||
- ❌ Requires manual pre-import
|
||||
- ❌ Image tied to specific storage
|
||||
|
||||
### Option 3: Fix Provider Code (Long-term)
|
||||
- Add task monitoring for `importdisk`
|
||||
- Wait for import completion before config updates
|
||||
- Better lock management and timeout handling
|
||||
|
||||
---
|
||||
|
||||
## Templates Requiring Update
|
||||
|
||||
### High Priority (Currently Broken)
|
||||
1. `vm-100.yaml` - Uses volid format, triggers importdisk
|
||||
|
||||
### Medium Priority (May Have Issues)
|
||||
All 28 templates using `ubuntu-22.04-cloud`:
|
||||
- May fail if image not found in storage
|
||||
- May timeout during storage search
|
||||
|
||||
---
|
||||
|
||||
## Action Plan
|
||||
|
||||
### Immediate
|
||||
1. ✅ **VMs 100 and 101 removed**
|
||||
2. ⏳ **Update `vm-100.yaml`** to use template format
|
||||
3. ⏳ **Test VM creation** with new format
|
||||
4. ⏳ **Decide on image strategy** for all templates
|
||||
|
||||
### Short-term
|
||||
1. Review all templates
|
||||
2. Standardize image format
|
||||
3. Document image requirements
|
||||
4. Test VM creation workflow
|
||||
|
||||
### Long-term
|
||||
1. Enhance provider code for importdisk handling
|
||||
2. Add image pre-import automation
|
||||
3. Create image management documentation
|
||||
|
||||
---
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After template updates:
|
||||
|
||||
- [ ] VM creates successfully
|
||||
- [ ] Disk is attached (`scsi0` configured)
|
||||
- [ ] Boot order is set (`boot: order=scsi0`)
|
||||
- [ ] Guest agent enabled (`agent: 1`)
|
||||
- [ ] Cloud-init configured (`ide2` present)
|
||||
- [ ] Network configured (`net0` present)
|
||||
- [ ] VM can start and boot
|
||||
- [ ] No lock issues
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `docs/VM_TEMPLATE_IMAGE_ISSUE_ANALYSIS.md` - Detailed technical analysis
|
||||
- `crossplane-provider-proxmox/pkg/proxmox/client.go` - Provider code
|
||||
- `examples/production/vm-100.yaml` - Problematic template
|
||||
- `examples/production/basic-vm.yaml` - Base template
|
||||
|
||||
---
|
||||
|
||||
**Status**: ✅ **VMs REMOVED** | ⚠️ **TEMPLATES NEED UPDATE**
|
||||
|
||||
114
docs/status/vms/VM_TEMPLATE_VZTMPL_ISSUE.md
Normal file
114
docs/status/vms/VM_TEMPLATE_VZTMPL_ISSUE.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# VM Template vztmpl Format Issue
|
||||
|
||||
**Date**: 2025-12-11
|
||||
**Issue**: vztmpl templates cannot be used for QEMU VMs
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
The provider code attempts to use `vztmpl` templates (LXC container templates) for QEMU VMs, which is incorrect.
|
||||
|
||||
**Template Format**: `local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst`
|
||||
|
||||
**Provider Behavior** (Line 297 in `client.go`):
|
||||
```go
|
||||
diskConfig = fmt.Sprintf("%s,format=qcow2", imageVolid)
|
||||
// Results in: local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst,format=qcow2
|
||||
```
|
||||
|
||||
**Problem**: Proxmox cannot use a `vztmpl` template as a QEMU VM disk. This format is for LXC containers only.
|
||||
|
||||
---
|
||||
|
||||
## Root Cause
|
||||
|
||||
1. **vztmpl templates** are for LXC containers
|
||||
2. **QEMU VMs** need either:
|
||||
- Cloud images (`.img`, `.qcow2`) - requires `importdisk`
|
||||
- QEMU templates (VM templates converted from VMs)
|
||||
|
||||
3. The provider code doesn't distinguish between container templates and VM templates
|
||||
|
||||
---
|
||||
|
||||
## Solutions
|
||||
|
||||
### Option 1: Use Cloud Image (Current)
|
||||
```yaml
|
||||
image: "local:iso/ubuntu-22.04-cloud.img"
|
||||
```
|
||||
|
||||
**Pros**:
|
||||
- ✅ Works with current provider code
|
||||
- ✅ Cloud-init ready
|
||||
- ✅ Available in storage
|
||||
|
||||
**Cons**:
|
||||
- ⚠️ Requires `importdisk` API (can cause locks)
|
||||
- ⚠️ Slower VM creation
|
||||
- ⚠️ Needs provider code fix for proper task monitoring
|
||||
|
||||
### Option 2: Create QEMU Template (Recommended Long-term)
|
||||
1. Create VM from cloud image
|
||||
2. Configure and customize
|
||||
3. Convert to template: `qm template <vmid>`
|
||||
4. Use template ID in image field
|
||||
|
||||
**Pros**:
|
||||
- ✅ Fast cloning
|
||||
- ✅ No import needed
|
||||
- ✅ Pre-configured
|
||||
|
||||
**Cons**:
|
||||
- ❌ Requires manual setup
|
||||
- ❌ Need to maintain templates
|
||||
|
||||
### Option 3: Fix Provider Code (Best Long-term)
|
||||
- Detect `vztmpl` format and reject for VMs
|
||||
- Add proper task monitoring for `importdisk`
|
||||
- Wait for import completion before config updates
|
||||
- Better error handling
|
||||
|
||||
---
|
||||
|
||||
## Current Status
|
||||
|
||||
**VM 100**: Reverted to use cloud image format
|
||||
- `image: "local:iso/ubuntu-22.04-cloud.img"`
|
||||
- Will use `importdisk` API
|
||||
- May experience lock issues until provider code is fixed
|
||||
|
||||
**All Other Templates**: Still using `vztmpl` format
|
||||
- ⚠️ **Will fail** when deployed
|
||||
- Need to be updated to cloud image format or QEMU template
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Immediate**: Update all templates to use cloud image format
|
||||
2. **Short-term**: Monitor VM 100 creation with cloud image
|
||||
3. **Long-term**: Fix provider code for proper template handling
|
||||
4. **Long-term**: Create QEMU templates for faster deployment
|
||||
|
||||
---
|
||||
|
||||
## Template Update Required
|
||||
|
||||
All 29 templates need to be updated from:
|
||||
```yaml
|
||||
image: "local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst"
|
||||
```
|
||||
|
||||
To:
|
||||
```yaml
|
||||
image: "local:iso/ubuntu-22.04-cloud.img"
|
||||
```
|
||||
|
||||
Or use QEMU template ID if available.
|
||||
|
||||
---
|
||||
|
||||
**Status**: ⚠️ **ISSUE IDENTIFIED - TEMPLATES NEED UPDATE**
|
||||
|
||||
Reference in New Issue
Block a user