Proxmox CSI Storage

This guide explains how storage works in the homelab using the Proxmox CSI (Container Storage Interface) plugin for dynamic volume provisioning.

Overview

The homelab uses the Proxmox CSI Plugin (csi.proxmox.sinextra.dev) as the primary storage provisioner for new Kubernetes workloads. This provides dynamic volume provisioning directly from Proxmox datastores without requiring additional storage layers.

Current Storage Classes:

proxmox-csi — Primary storage class (Retain policy, Immediate binding, expandable)
longhorn — Legacy storage class for existing workloads (being phased out)
longhorn-static — Legacy static provisioning

The Proxmox CSI plugin allows applications to automatically request and receive persistent storage without manual intervention, with volumes created directly on the Proxmox Nvme1 ZFS datastore.

How Dynamic Provisioning Works

The Proxmox CSI plugin provides fully automatic storage provisioning. You don't need to pre-create volumes, manually attach disks, or configure storage backends. Just create a PVC and the CSI plugin handles everything.

The Process (Completely Automatic)

You create a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data
spec:
  storageClassName: proxmox-csi  # References the StorageClass
  resources:
    requests:
      storage: 10Gi

CSI Controller sees the PVC and automatically:
- Calls Proxmox API to create a new virtual disk: vm-XXXX-pvc-<uuid>
- Attaches the disk to the appropriate Proxmox node
- Formats the disk with ext4 (or specified filesystem)
- Creates a PersistentVolume (PV) in Kubernetes
- Binds the PVC to the PV
Done! Your pod can now mount the volume. The entire process is automatic - no manual intervention needed.

Key Benefits

Zero manual steps: No need to SSH into Proxmox or run pvesm commands
Automatic placement: Volumes are created on the same node where the pod is scheduled (WaitForFirstConsumer)
Direct ZFS access: Volumes are ZFS datasets on Nvme1, providing high performance
Volume expansion: Resize PVCs dynamically without recreating them
Clean lifecycle: When you delete a PVC, the volume is retained (Retain policy) for data safety

Why Not Pre-Provision Volumes?

Unlike older storage systems, you should never pre-create volumes manually. The CSI plugin is designed for dynamic provisioning - it creates volumes on-demand as applications request them.

The bootstrap/volumes Terraform module exists only for migrating pre-existing Proxmox volumes into Kubernetes, not for creating new storage.

Bootstrap Configuration

The storage bootstrap is managed through OpenTofu in the tofu/bootstrap.tf file.

Proxmox CSI Plugin Setup

The proxmox-csi-plugin module in tofu/bootstrap.tf automatically configures:

Proxmox User & Role: Creates a kubernetes-csi@pve user with minimal CSI permissions
API Token: Generates a secure API token with privileges_separation = true
Kubernetes Resources:
- Creates csi-proxmox namespace with PodSecurity privileged labels
- Stores Proxmox credentials in a Kubernetes secret

Command to deploy:

cd tofu
tofu apply

Terraform Module Reference:

# From tofu/bootstrap.tf
module "proxmox-csi-plugin" {
  source = "./bootstrap/proxmox-csi-plugin"

  proxmox = {
    cluster_name = var.proxmox_cluster
    endpoint     = var.proxmox.endpoint
    insecure     = var.proxmox.insecure
  }
}

Security Configuration

The CSI plugin uses a least-privilege security model:

Setting	Value	Purpose
Role Privileges	`Sys.Audit`, `VM.Audit`, `VM.Config.Disk`, `Datastore.*`	Minimal required for CSI operations
Token	`privileges_separation = false`	Token inherits full user privileges, enabling storage/volume access
Namespace	`pod-security.kubernetes.io/enforce: privileged`	Required for CSI node plugins

Why privileges_separation = false?

Token needs full access to Proxmox resources (storage allocation, VM disk operations)
With privileges_separation = true, the token is restricted to CSI role only, causing "not authorized" errors
Full user privileges are required for the CSI plugin to manage volumes across nodes

Using Storage in Applications

Dynamic Provisioning Example

Most applications should use dynamic provisioning:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: proxmox-csi
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:15
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: postgres-data

That's It!

Notice what you didn't have to do:

No manual volume creation in Proxmox
No SSH into Proxmox nodes
No pvesm alloc commands
No manual PV creation
No volume attachment configuration

The CSI plugin handles all of this automatically when you create the PVC. This is the power of dynamic provisioning!

StorageClass Configuration

The Proxmox CSI plugin typically provides a default StorageClass. You can create additional StorageClasses for different storage backends or performance tiers:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: proxmox-ssd
provisioner: csi.proxmox.sinextra.dev
parameters:
  storage: local-zfs
  cache: writethrough
  ssd: "true"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true

Volume Binding Mode Decision

Why We Use `Immediate` Instead of `WaitForFirstConsumer`

The homelab's proxmox-csi StorageClass uses volumeBindingMode: Immediate binding mode. This section documents why this decision was made and the trade-offs involved.

Context: Single-Zone Cluster

This homelab runs a single-zone Kubernetes cluster where all worker nodes are in the same physical location (same Proxmox cluster, same datacenter, same network). There are no multiple availability zones, regions, or geographically distributed nodes.

The Problem with WaitForFirstConsumer

What is WaitForFirstConsumer?

PVC creation does NOT immediately provision a PV
Volume provisioning is delayed until a pod that uses the PVC is scheduled
The volume is then created on the same node/zone where the pod is scheduled
Purpose: Ensures volume locality in multi-zone clusters (volume created in same zone as pod)

Why it caused problems in our setup:

Velero Restore Deadlock (Primary Issue)
- During disaster recovery, Velero restores PVCs and Pods simultaneously
- PVCs stay Pending (waiting for pod to be scheduled)
- Pods stay Pending (waiting for PVC to be bound)
- Result: Chicken-and-egg deadlock - nothing progresses without manual intervention
- Manual fix required: annotating each PVC with volume.kubernetes.io/selected-node=<node> to break deadlock
Unbalanced Pod Distribution
- After Velero restore with manual node annotations, all pods scheduled on same node
- Created single point of failure (57% of pods on one node after migration)
- Kubernetes scheduler couldn't rebalance because PVCs were already bound to specific node
No Topology Benefit in Single-Zone
- In single-zone clusters, all nodes can access all storage equally
- Topology awareness provides zero benefit
- WaitForFirstConsumer only adds complexity without any advantage

The Solution: Immediate Binding

What is Immediate?

PV is provisioned as soon as PVC is created
Volume is created immediately, no waiting for pod scheduling
In single-zone: volume created on any available node (same outcome as WaitForFirstConsumer)

Why we chose Immediate:

✅ Fixes Velero Restore Issues

PVCs bind immediately upon creation during restore
No chicken-and-egg deadlock
Disaster recovery "just works" without manual intervention

✅ Kubernetes Scheduler Handles Pod Distribution

Scheduler's built-in spreading logic distributes pods across nodes
No manual topology constraints needed
Pods naturally balance across worker nodes over time

✅ Simpler Operations

No special handling required for restores
No manual node annotations needed
Fewer moving parts = fewer failure modes

✅ Same Outcome in Single-Zone

Volume ends up on same node as pod (shared storage pool)
No performance difference
No locality benefit lost (there was none to begin with)

Technical Analysis: Immediate vs WaitForFirstConsumer

When WaitForFirstConsumer is Essential:

Multi-Zone Topology (NOT our setup)

Nodes in different availability zones (us-east-1a, us-east-1b)
Volumes must be created in same zone as pod (cross-zone attachment often impossible)
Cloud providers charge for cross-zone traffic ($0.01-0.02/GB)
Latency penalty for cross-zone access (5-10ms+ added latency)
This is the ONLY legitimate use case for WaitForFirstConsumer

Heterogeneous Storage (NOT our setup)

Different nodes have different storage types (local NVMe vs network SAN)
Need to ensure volume created on node with correct backend
We have shared ZFS storage - all nodes access same datastore

When Immediate is Correct:

Single-Zone Clusters (our setup)

All nodes in same physical location, same storage pool
No cross-zone penalties to avoid
No topology constraints to enforce
WaitForFirstConsumer provides ZERO benefit, only operational complexity

Shared Storage Architecture (our setup)

Proxmox ZFS datastore accessible from all worker nodes
Volume location is irrelevant - any node can attach any volume
No performance or cost difference based on volume placement

Resource Usage Analysis

Claim: "Immediate wastes resources by provisioning unused volumes"

Reality Check:

ZFS is thin-provisioned by default - volumes only consume space for actual data written
- Creating a 100Gi PVC allocates 0 bytes until data is written
- No resource waste from "pre-provisioning"
PVCs are created on-demand - we don't create unused PVCs
- StatefulSets create PVCs when pods are created
- Manual PVCs are only created when needed
- Theoretical problem with no real-world occurrence

Measured Impact: NONE

Immediate binding has identical resource usage to WaitForFirstConsumer in practice
Both modes result in same number of volumes, same data stored
No measurable difference in storage consumption, API calls, or performance

What We Actually Gave Up: Nothing

WaitForFirstConsumer Benefits:

✅ Topology-aware placement → Not applicable (single-zone)
✅ Deferred provisioning → Not useful (thin-provisioned storage)
✅ Guaranteed co-location → Not beneficial (shared storage pool)

WaitForFirstConsumer Costs:

❌ Velero restore failures (chicken-and-egg deadlock)
❌ Manual intervention required for disaster recovery
❌ Unbalanced pod distribution after restores
❌ Increased operational complexity
❌ Harder to troubleshoot PVC binding issues

Net Result: WaitForFirstConsumer has ZERO benefits and significant costs in single-zone clusters with shared storage.

Decision Matrix

Cluster Architecture	Correct Binding Mode	Reason
Single-zone cluster	`Immediate`	No topology constraints, no cross-zone penalties, simpler DR
Multi-zone cluster	`WaitForFirstConsumer`	Essential for zone-aware placement, avoids cross-zone costs
Heterogeneous storage	`WaitForFirstConsumer`	Ensures volume created on node with correct storage backend
Shared storage pool	`Immediate`	Volume location irrelevant, all nodes access same storage

Our Setup: Single-zone cluster + shared ZFS storage = Immediate is objectively correct

Migration Path to Multi-Zone

If expanding to multi-zone architecture:

Change StorageClass to WaitForFirstConsumer

volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: topology.kubernetes.io/zone
    values: [zone-a, zone-b, zone-c]

Update Velero backup strategy
- Document manual PVC node annotation procedure for restores
- Or accept unbalanced distribution and rely on descheduler for rebalancing
- Or use CSI snapshots instead of filesystem backups (if Proxmox CSI supports it)
Test disaster recovery procedure
- Verify restores work with WaitForFirstConsumer deadlock
- Document manual intervention steps for production runbooks

Implementation Notes

The Proxmox CSI Helm chart hardcodes volumeBindingMode: WaitForFirstConsumer in the StorageClass template. To override this:

Modified Chart Template (charts/proxmox-csi-plugin/templates/storageclass.yaml):

volumeBindingMode: {{ default "WaitForFirstConsumer" $storage.volumeBindingMode }}

Values Override (k8s/infrastructure/storage/proxmox-csi/values.yaml):

storageClass:
  - name: proxmox-csi
    volumeBindingMode: Immediate  # Override hardcoded value
    # ... other settings

This allows configuring the binding mode while maintaining chart upgrade compatibility.

Longhorn to Proxmox CSI Migration: Velero restore deadlock was discovered during storage migration (see Migration Guide)
Pod Distribution: Without topology constraints, Kubernetes scheduler naturally spreads pods across nodes based on resource availability
Future Multi-Zone Support: If expanding to multi-zone cluster, change to WaitForFirstConsumer and add allowedTopologies to StorageClass

Volume Management

Listing Volumes

View all persistent volumes:

kubectl get pv
kubectl get pvc -A

Expanding Volumes

If the StorageClass allows expansion (allowVolumeExpansion: true), you can resize volumes:

kubectl patch pvc my-app-data -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Deleting Volumes

The reclaim policy determines what happens when a PVC is deleted:

Delete: Volume is automatically deleted from Proxmox
Retain: Volume is kept in Proxmox for manual recovery

kubectl delete pvc my-app-data

Access Mode Limitations

ReadWriteMany (RWX) Not Supported

Proxmox CSI only supports ReadWriteOnce (RWO) access mode. The plugin does not support ReadWriteMany (RWX) or ReadOnlyMany (ROX) access modes.

Why RWX doesn't work:

Proxmox CSI creates dedicated virtual disks on ZFS datastores
Each disk can only be attached to one VM/node at a time
There is no shared filesystem backend (like NFS or CephFS) to support multi-node access

If your application requires RWX:

Verify actual need: Many applications claim RWX but work fine with RWO when pods are scheduled on the same node
Use RWO with pod scheduling: Deploy pods using podAntiAffinity rules to ensure all pods requiring shared storage run on the same node
Deploy NFS storage: For true multi-writer workloads, deploy a separate NFS-based StorageClass (e.g., from a dedicated NAS or cloud NFS service)

Migrating from Longhorn RWX PVCs:

When migrating workloads from Longhorn (which supported RWX), you must patch PVCs to use RWO:

# Find RWX PVCs
kubectl get pvc -A -o jsonpath='{range .items[?(@.spec.accessModes[0]=="ReadWriteMany")]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'

# Patch each RWX PVC to RWO
kubectl patch pvc <pvc-name> -n <namespace> -p '{"spec":{"accessModes":["ReadWriteOnce"]}}'

Important: Patch RWX PVCs before creating Velero backups for migration. The storage class mapping only handles storage class transformation, not access mode changes.

Troubleshooting

Access Mode Limitations

Proxmox CSI only supports ReadWriteOnce (RWO) access mode. See the Access Mode Limitations section for details.

CSI Plugin Not Provisioning Volumes

Check the CSI plugin is running:
```
kubectl get pods -n csi-proxmox
```

Check CSI controller logs:

kubectl logs -n csi-proxmox -l app=proxmox-csi-controller

Verify the Proxmox credentials secret:

kubectl get secret -n csi-proxmox proxmox-csi-plugin -o yaml

Volume Stuck in Pending

Check PVC events:

kubectl describe pvc <pvc-name>

Common issues:

Insufficient storage on Proxmox datastore
Network connectivity between Kubernetes and Proxmox
Invalid storage backend name
CSI plugin not running

Permission Errors

Verify the CSI user has correct permissions in Proxmox:

pveum user list | grep kubernetes-csi
pveum acl list | grep kubernetes-csi

Overview​

How Dynamic Provisioning Works​

The Process (Completely Automatic)​

Key Benefits​

Why Not Pre-Provision Volumes?​

Bootstrap Configuration​

Proxmox CSI Plugin Setup​

Security Configuration​

Using Storage in Applications​

Dynamic Provisioning Example​

That's It!​

StorageClass Configuration​

Volume Binding Mode Decision​

Why We Use Immediate Instead of WaitForFirstConsumer​

Context: Single-Zone Cluster​

The Problem with WaitForFirstConsumer​

The Solution: Immediate Binding​

Technical Analysis: Immediate vs WaitForFirstConsumer​

Resource Usage Analysis​

What We Actually Gave Up: Nothing​

Decision Matrix​

Migration Path to Multi-Zone​

Implementation Notes​

Related Issues​

Volume Management​

Listing Volumes​

Expanding Volumes​

Deleting Volumes​

Access Mode Limitations​

ReadWriteMany (RWX) Not Supported​

Troubleshooting​

Access Mode Limitations​

CSI Plugin Not Provisioning Volumes​

Volume Stuck in Pending​

Permission Errors​

References​

Overview

How Dynamic Provisioning Works

The Process (Completely Automatic)

Key Benefits

Why Not Pre-Provision Volumes?

Bootstrap Configuration

Proxmox CSI Plugin Setup

Security Configuration

Using Storage in Applications

Dynamic Provisioning Example

That's It!

StorageClass Configuration

Volume Binding Mode Decision

Why We Use `Immediate` Instead of `WaitForFirstConsumer`

Context: Single-Zone Cluster

The Problem with WaitForFirstConsumer

The Solution: Immediate Binding

Technical Analysis: Immediate vs WaitForFirstConsumer

Resource Usage Analysis

What We Actually Gave Up: Nothing

Decision Matrix

Migration Path to Multi-Zone

Implementation Notes

Related Issues

Volume Management

Listing Volumes

Expanding Volumes

Deleting Volumes

Access Mode Limitations

ReadWriteMany (RWX) Not Supported

Troubleshooting

Access Mode Limitations

CSI Plugin Not Provisioning Volumes

Volume Stuck in Pending

Permission Errors

References