Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
369
docs/VM_CREATION_FAILURE_ANALYSIS.md
Normal file
369
docs/VM_CREATION_FAILURE_ANALYSIS.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# VM Creation Failure Analysis & Prevention Guide
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.
|
||||
|
||||
**Critical Finding**: The `importdisk` API endpoint (`POST /nodes/{node}/qemu/{vmid}/importdisk`) is **NOT IMPLEMENTED** in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.
|
||||
|
||||
---
|
||||
|
||||
## 1. Root Cause Analysis
|
||||
|
||||
### Primary Failure: importdisk API Not Implemented
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397-400`
|
||||
|
||||
**Error**:
|
||||
```
|
||||
501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- VM is created successfully (blank disk)
|
||||
- Image import fails immediately
|
||||
- VM remains in locked state (`lock-{vmid}.conf`)
|
||||
- Controller retries indefinitely (VMID never set in status)
|
||||
- Each retry creates a NEW VM (perpetual creation loop)
|
||||
|
||||
**Code Path**:
|
||||
```go
|
||||
// Line 350-400: createVM() function
|
||||
if needsImageImport && imageVolid != "" {
|
||||
// ... stops VM ...
|
||||
// Line 397: Attempts importdisk API call
|
||||
if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
|
||||
// Line 399: Returns error, VM already created but orphaned
|
||||
return nil, errors.Wrapf(err, "failed to import image...")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Controller Behavior**:
|
||||
```go
|
||||
// Line 142-145: controller.go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// Returns error, but VM already exists in Proxmox
|
||||
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
// Status never updated (VMID stays 0), causing infinite retry loop
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Working vs Non-Working Attempts
|
||||
|
||||
### ✅ WORKING Approaches
|
||||
|
||||
#### 2.1 VM Deletion (Force Removal)
|
||||
**Script**: `scripts/force-remove-all-remaining.sh`
|
||||
**Method**:
|
||||
- Multiple unlock attempts (10x with delays)
|
||||
- Stop VM if running
|
||||
- Delete with `purge=1&skiplock=1` parameters
|
||||
- Wait for task completion (up to 60 seconds)
|
||||
- Verify deletion
|
||||
|
||||
**Success Rate**: 100% (all 66 VMs eventually deleted)
|
||||
|
||||
**Key Success Factors**:
|
||||
1. **Aggressive unlocking**: 10 unlock attempts with 1-second delays
|
||||
2. **Long wait times**: 60-second timeout for delete tasks
|
||||
3. **Verification**: Confirms VM is actually deleted before proceeding
|
||||
|
||||
#### 2.2 Controller Scaling
|
||||
**Command**: `kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0`
|
||||
**Result**: Immediately stops all VM creation processes
|
||||
**Status**: ✅ Effective
|
||||
|
||||
### ❌ NON-WORKING Approaches
|
||||
|
||||
#### 2.1 importdisk API Usage
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397`
|
||||
**Problem**: API endpoint not implemented in Proxmox version
|
||||
**Error**: `501 Method not implemented`
|
||||
**Impact**: All VM creations with cloud images fail
|
||||
|
||||
#### 2.2 Single Unlock Attempt
|
||||
**Problem**: Lock files persist after single unlock
|
||||
**Result**: Delete operations timeout with "can't lock file" errors
|
||||
**Solution**: Multiple unlock attempts (10x) required
|
||||
|
||||
#### 2.3 Short Timeouts
|
||||
**Problem**: 20-second timeout insufficient for delete operations
|
||||
**Result**: Tasks appear to fail but actually complete later
|
||||
**Solution**: 60-second timeout with verification
|
||||
|
||||
#### 2.4 No Error Recovery
|
||||
**Problem**: Controller doesn't handle partial VM creation
|
||||
**Result**: Orphaned VMs accumulate when importdisk fails
|
||||
**Impact**: Status never updates, infinite retry loop
|
||||
|
||||
---
|
||||
|
||||
## 3. Codebase Inconsistencies & Repeated Failures
|
||||
|
||||
### 3.1 CRITICAL: No Error Recovery for Partial VM Creation
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`
|
||||
|
||||
**Problem**:
|
||||
```go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// ❌ VM already created in Proxmox, but error returned
|
||||
// ❌ No cleanup of orphaned VM
|
||||
// ❌ Status never updated (VMID stays 0)
|
||||
// ❌ Controller will retry forever, creating new VMs
|
||||
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
```
|
||||
|
||||
**Fix Required**:
|
||||
```go
|
||||
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||||
if err != nil {
|
||||
// Check if VM was partially created
|
||||
if createdVM != nil && createdVM.ID > 0 {
|
||||
// Attempt cleanup
|
||||
logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
|
||||
cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
|
||||
if cleanupErr != nil {
|
||||
logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
|
||||
}
|
||||
}
|
||||
// Don't requeue immediately - wait longer to prevent rapid retries
|
||||
return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 CRITICAL: importdisk API Not Checked Before Use
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`
|
||||
|
||||
**Problem**: Code assumes `importdisk` API exists without checking Proxmox version or API availability.
|
||||
|
||||
**Fix Required**:
|
||||
```go
|
||||
// Before attempting importdisk, check if API is available
|
||||
// Option 1: Check Proxmox version
|
||||
pveVersion, err := c.GetPVEVersion(ctx)
|
||||
if err != nil || !supportsImportDisk(pveVersion) {
|
||||
return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
|
||||
}
|
||||
|
||||
// Option 2: Use alternative method (qm disk import via SSH/API)
|
||||
// Option 3: Require images to be pre-imported as templates
|
||||
```
|
||||
|
||||
### 3.3 CRITICAL: No Status Update on Partial Failure
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`
|
||||
|
||||
**Problem**: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.
|
||||
|
||||
**Current Flow**:
|
||||
1. VM created in Proxmox (VMID assigned)
|
||||
2. importdisk fails
|
||||
3. Error returned, status never updated
|
||||
4. `vm.Status.VMID == 0` still true
|
||||
5. Controller retries, creates new VM
|
||||
|
||||
**Fix Required**: Add intermediate status updates or cleanup on failure.
|
||||
|
||||
### 3.4 Inconsistent Error Handling
|
||||
|
||||
**Location**: Multiple locations
|
||||
|
||||
**Problem**: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.
|
||||
|
||||
**Examples**:
|
||||
- Line 53: Credentials error → requeue after 30s
|
||||
- Line 60: Site error → requeue after 30s
|
||||
- Line 144: VM creation error → no requeue (but should have longer delay)
|
||||
|
||||
**Fix Required**: Define error categories and consistent requeue strategies.
|
||||
|
||||
### 3.5 Lock File Handling Inconsistency
|
||||
|
||||
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821` (UnlockVM)
|
||||
|
||||
**Problem**: UnlockVM function exists but is never called during VM creation failure recovery.
|
||||
|
||||
**Fix Required**: Call UnlockVM before DeleteVM in error recovery paths.
|
||||
|
||||
---
|
||||
|
||||
## 4. ml110-01 Node Status: "Unknown" in Web Portal
|
||||
|
||||
### Investigation Results
|
||||
|
||||
**API Status Check**: ✅ Node is healthy
|
||||
- CPU: 0.027 (2.7% usage)
|
||||
- Memory: 9.2GB used / 270GB total
|
||||
- Uptime: 460,486 seconds (~5.3 days)
|
||||
- PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83`
|
||||
- Kernel: `6.17.2-1-pve`
|
||||
|
||||
**Web Portal Issue**: Likely a display/UI issue, not an actual node problem.
|
||||
|
||||
**Possible Causes**:
|
||||
1. Web UI cache issue
|
||||
2. Cluster quorum/communication issue (if in cluster)
|
||||
3. Web UI version mismatch
|
||||
4. Browser cache
|
||||
|
||||
**Recommendation**:
|
||||
- Refresh web portal
|
||||
- Check cluster status: `pvecm status` (if in cluster)
|
||||
- Verify node is reachable: `ping ml110-01`
|
||||
- Check Proxmox logs: `/var/log/pveproxy/access.log`
|
||||
|
||||
---
|
||||
|
||||
## 5. Recommendations to Prevent Future Failures
|
||||
|
||||
### 5.1 Immediate Fixes (Critical)
|
||||
|
||||
1. **Add Error Recovery for Partial VM Creation**
|
||||
- Detect when VM is created but import fails
|
||||
- Clean up orphaned VMs automatically
|
||||
- Update status to prevent infinite retries
|
||||
|
||||
2. **Check importdisk API Availability**
|
||||
- Verify Proxmox version supports importdisk
|
||||
- Provide fallback method (template cloning, pre-imported images)
|
||||
- Document supported Proxmox versions
|
||||
|
||||
3. **Improve Status Update Logic**
|
||||
- Update status even on partial failures
|
||||
- Add conditions to track failure states
|
||||
- Prevent infinite retry loops
|
||||
|
||||
### 5.2 Short-term Improvements
|
||||
|
||||
1. **Add VM Cleanup on Controller Startup**
|
||||
- Scan for orphaned VMs (created but no corresponding Kubernetes resource)
|
||||
- Clean up VMs with stuck locks
|
||||
- Log cleanup actions
|
||||
|
||||
2. **Implement Exponential Backoff**
|
||||
- Current: Fixed 30s requeue
|
||||
- Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
|
||||
- Prevents rapid retry storms
|
||||
|
||||
3. **Add Health Checks**
|
||||
- Verify Proxmox API endpoints before use
|
||||
- Check node status before VM creation
|
||||
- Validate image availability
|
||||
|
||||
### 5.3 Long-term Improvements
|
||||
|
||||
1. **Alternative Image Import Methods**
|
||||
- Use `qm disk import` via SSH (if available)
|
||||
- Pre-import images as templates
|
||||
- Use Proxmox templates instead of cloud images
|
||||
|
||||
2. **Better Observability**
|
||||
- Add metrics for VM creation success/failure rates
|
||||
- Track orphaned VM counts
|
||||
- Alert on stuck VM creation loops
|
||||
|
||||
3. **Comprehensive Testing**
|
||||
- Test with different Proxmox versions
|
||||
- Test error recovery scenarios
|
||||
- Test lock file handling
|
||||
|
||||
---
|
||||
|
||||
## 6. Code Locations Requiring Fixes
|
||||
|
||||
### High Priority
|
||||
|
||||
1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`**
|
||||
- Add error recovery for partial VM creation
|
||||
- Implement cleanup logic
|
||||
|
||||
2. **`crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`**
|
||||
- Check importdisk API availability
|
||||
- Add fallback methods
|
||||
- Improve error messages
|
||||
|
||||
3. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`**
|
||||
- Add intermediate status updates
|
||||
- Prevent infinite retry loops
|
||||
|
||||
### Medium Priority
|
||||
|
||||
4. **`crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`**
|
||||
- Use UnlockVM in error recovery paths
|
||||
|
||||
5. **Error handling throughout controller**
|
||||
- Standardize requeue strategies
|
||||
- Add error categorization
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing Checklist
|
||||
|
||||
Before deploying fixes, test:
|
||||
|
||||
- [ ] VM creation with importdisk API (if supported)
|
||||
- [ ] VM creation with template cloning
|
||||
- [ ] Error recovery when importdisk fails
|
||||
- [ ] Cleanup of orphaned VMs
|
||||
- [ ] Lock file handling
|
||||
- [ ] Controller retry behavior
|
||||
- [ ] Status update on partial failures
|
||||
- [ ] Multiple concurrent VM creations
|
||||
- [ ] Node status checks
|
||||
- [ ] Proxmox version compatibility
|
||||
|
||||
---
|
||||
|
||||
## 8. Documentation Updates Needed
|
||||
|
||||
1. **README.md**: Document supported Proxmox versions
|
||||
2. **API Compatibility**: List which APIs are required
|
||||
3. **Troubleshooting Guide**: Add section on orphaned VMs
|
||||
4. **Error Recovery**: Document automatic cleanup features
|
||||
5. **Image Requirements**: Clarify template vs cloud image usage
|
||||
|
||||
---
|
||||
|
||||
## 9. Lessons Learned
|
||||
|
||||
1. **Always verify API availability** before using it
|
||||
2. **Implement error recovery** for partial resource creation
|
||||
3. **Update status early** to prevent infinite retry loops
|
||||
4. **Test with actual infrastructure** versions, not just mocks
|
||||
5. **Monitor for orphaned resources** and implement cleanup
|
||||
6. **Use exponential backoff** for retries
|
||||
7. **Document failure modes** and recovery procedures
|
||||
|
||||
---
|
||||
|
||||
## 10. Summary
|
||||
|
||||
**Primary Issue**: `importdisk` API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop
|
||||
|
||||
**Root Causes**:
|
||||
1. No API availability check
|
||||
2. No error recovery for partial creation
|
||||
3. No status update on failure
|
||||
4. No cleanup of orphaned resources
|
||||
|
||||
**Solutions**:
|
||||
1. Check API availability before use
|
||||
2. Implement error recovery and cleanup
|
||||
3. Update status even on partial failures
|
||||
4. Add health checks and monitoring
|
||||
|
||||
**Status**: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.
|
||||
|
||||
---
|
||||
|
||||
*Last Updated: 2025-12-12*
|
||||
*Document Version: 1.0*
|
||||
|
||||
Reference in New Issue
Block a user