Scenario 5: Total Site Loss

Symptoms

Entire home/building destroyed (fire, flood, natural disaster)
All local infrastructure completely lost
All physical equipment destroyed
No access to original location
Local network infrastructure gone
All local backups destroyed (TrueNAS, MinIO)
Only survivors: GitHub repository and Backblaze B2 backups

Impact Assessment

Recovery Time Objective (RTO): 8-24 hours (hardware dependent)
Recovery Point Objective (RPO): Up to 1 week (last weekly B2 backup)
Data Loss Risk: High - limited to weekly backup schedule
Service Availability: Complete outage until infrastructure is rebuilt
Personal Impact: High - home disaster, potential displacement
Financial Impact: Significant - hardware replacement, potential insurance claim
Emotional Impact: High - prioritize personal safety and well-being first

Prerequisites

CRITICAL: What You Need Access To

This scenario assumes you have:

Personal Safety:
- You and your family are safe
- You have access to temporary housing/workspace
- You have a computer/laptop to work from
Account Access (stored separately from your homelab):
- Bitwarden master password (memorized or stored separately)
- GitHub account access (2FA codes, recovery codes)
- Backblaze B2 account access (2FA codes if enabled)
- Email access (for password resets if needed)
Documentation Access:
- This disaster recovery documentation (ideally saved offline or printed)
- Network diagrams (if stored separately)
- Hardware configurations (if documented elsewhere)
Financial Resources:
- Budget for new hardware
- Credit card or funds for purchases
- Insurance policy information (if applicable)

Required Software (Install on your workstation)

# Install required CLIs
# macOS (using Homebrew)
brew install opentofu kubectl talosctl velero argocd git

# Linux
# Download binaries from official releases:
# - OpenTofu: https://github.com/opentofu/opentofu/releases
# - kubectl: https://kubernetes.io/docs/tasks/tools/
# - talosctl: https://github.com/siderolabs/talos/releases
# - velero: https://github.com/vmware-tanzu/velero/releases
# - argocd: https://github.com/argoproj/argo-cd/releases

# Windows
# Use WSL2 with Linux instructions, or install individual binaries

Decision: New Hardware Location

Option A: Rebuild at Original Location

If your home is being rebuilt:

Same network configuration possible
Insurance may cover hardware replacement
Can use same IP addressing scheme
Longer timeline (waiting for home reconstruction)

Option B: Temporary/New Location

If rebuilding elsewhere:

May need different network configuration
Faster deployment possible
Consider cloud hosting as interim solution
Need to update DNS and firewall rules

Option C: Cloud Migration

Consider cloud-based recovery:

AWS EKS, Azure AKS, or Google GKE
Faster initial recovery
Higher ongoing costs
May convert to permanent solution

Recovery Procedure

Phase 1: Emergency Preparation

Step 1: Ensure Personal Safety and Stability

Before attempting technical recovery:

[ ] You and family are safe and in stable housing
[ ] Insurance claim filed (if applicable)
[ ] Essential documentation retrieved/replaced
[ ] Stable internet connection available
[ ] Working computer/laptop available
[ ] Financial resources for hardware purchases confirmed

Technical recovery can wait. Your safety comes first.

Step 2: Verify Access to Critical Accounts

# Verify Bitwarden access
# Login to: https://vault.bitwarden.com
# Retrieve all necessary credentials

# Verify GitHub access
git clone git@github.com:theepicsaxguy/homelab.git
# If SSH key lost, use HTTPS with PAT or create new SSH key

# Verify B2 access
# Login to: https://www.backblaze.com/b2/sign-in.html
# Verify buckets exist:
# - homelab-velero-b2
# - homelab-cnpg-b2
# - homelab-terraform-state

Step 3: Inventory What Survived

Document what you have access to:

# Create recovery checklist
cat > ~/recovery-checklist.md <<EOF
# Total Site Loss Recovery Checklist

## Access Verified
- [ ] Bitwarden account
- [ ] GitHub repository
- [ ] Backblaze B2 buckets
- [ ] Email accounts
- [ ] Domain registrar

## Last Known Good State
- Last B2 backup date: <check Velero backups>
- Last code commit: <check GitHub>
- Last OpenTofu state: <check B2 state bucket>

## Hardware Decisions
- Location: <original/temporary/cloud>
- Timeline: <immediate/weeks/months>
- Budget: <amount available>

## Network Planning
- Keep original IPs (10.25.150.x): YES / NO
- VLANs: <same/different>
- Internet provider: <same/different>
- Domain names: <keep/change>
EOF

Phase 2: Acquire Infrastructure

Step 4: Procure New Hardware

Minimum Hardware Requirements:

Shopping List:
[ ] Server/Workstation
    CPU: 8+ cores (16+ recommended)
    RAM: 64GB minimum (128GB+ recommended)
    Storage: 2x 500GB+ NVMe SSDs
    Network: Gigabit Ethernet

[ ] Network Equipment
    Router with VLAN support
    Managed switch (optional)
    Network cables, power strips

[ ] Proxmox VE Installation Media
    USB drive (8GB+)
    Download: https://www.proxmox.com/en/downloads

Hardware Options:

# Option 1: Home Server Hardware
# - Dell PowerEdge (R720, R730)
# - HP ProLiant (DL380, DL360)
# - Custom build (ASUS, Supermicro boards)

# Option 2: Workstation Conversion
# - High-end workstation repurposed
# - Must support virtualization (VT-x/AMD-V)

# Option 3: Cloud Provider (Temporary)
# - Hetzner dedicated servers
# - OVH dedicated servers
# - DigitalOcean (for testing recovery procedure)

Step 5: Set Up Network Infrastructure

Configure your network:

If using same IP scheme (10.25.150.0/24):

# Configure router/switch for VLAN 150
# Assign gateway: 10.25.150.1
# Reserve IPs:
#   10.25.150.3  - Proxmox host
#   10.25.150.5-6 - Load balancers
#   10.25.150.9  - API LB VIP
#   10.25.150.10 - Cluster VIP
#   10.25.150.11-13 - Control planes
#   10.25.150.21-23 - Workers

If using different network:

You'll need to update OpenTofu configurations (see Troubleshooting section).

Phase 3: Install Base Infrastructure

Step 6: Install Proxmox VE

Follow Proxmox installation:

1. Boot from installation USB
2. Select "Install Proxmox VE"
3. Accept license
4. Select target disk (NVMe)
5. Configure:
   Country: <your country>
   Timezone: <your timezone>
   Keyboard: <your layout>
6. Set root password (SAVE IN BITWARDEN!)
7. Network configuration:
   Hostname: host3.peekoff.com
   IP: 10.25.150.3
   Netmask: 255.255.255.0
   Gateway: 10.25.150.1
   DNS: 10.25.150.1
8. Confirm and install
9. Reboot

Post-installation setup:

# SSH to Proxmox
ssh root@10.25.150.3

# Update system
apt update && apt dist-upgrade -y

# Configure storage
# For ZFS (recommended):
zpool create -f Nvme1 /dev/nvme0n1
zpool create -f Nvme2 /dev/nvme1n1

# Or use LVM/Directory storage
pvesm add dir local --path /var/lib/vz --content vztmpl,iso,backup

Configure network for VLAN 150:

# Edit network interfaces
nano /etc/network/interfaces

# Add VLAN-aware bridge:
auto vmbr0
iface vmbr0 inet static
    address 10.25.150.3/24
    gateway 10.25.150.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes

# Apply
ifreload -a

Phase 4: Restore Infrastructure as Code

Step 7: Clone GitHub Repository

# On your workstation
git clone git@github.com:theepicsaxguy/homelab.git
cd homelab

# Verify repository integrity
git log --oneline -10
git status

Step 8: Configure B2 Backend Access

# Get B2 credentials from Bitwarden
# Item: "backblaze-b2-velero-offsite" or "terraform-state-b2"

export AWS_ACCESS_KEY_ID="<B2_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<B2_APPLICATION_KEY>"

# Test B2 access
aws s3 ls s3://homelab-terraform-state \
  --endpoint-url=https://s3.us-west-002.backblazeb2.com

# Should show: proxmox/terraform.tfstate

Step 9: Enable OpenTofu Remote Backend

Edit tofu/backend.tf and uncomment:

terraform {
  backend "s3" {
    bucket = "homelab-terraform-state"
    key    = "proxmox/terraform.tfstate"
    region = "us-west-000"
    endpoint = "https://s3.us-west-002.backblazeb2.com"
    skip_credentials_validation = true
    skip_metadata_api_check     = true
    skip_region_validation      = true
    skip_requesting_account_id  = true
    use_path_style              = false
  }
}

Step 10: Initialize OpenTofu with Remote State

cd homelab/tofu

# Initialize and pull state from B2
tofu init

# Verify state
tofu show | head -50

# You should see your previous infrastructure configuration

Phase 5: Deploy Infrastructure

Step 11: Configure Proxmox Credentials

# Create API token in Proxmox UI:
# Datacenter → Permissions → API Tokens → Add
# Token ID: root@pam!tofu
# Copy the secret (shown only once!)

# Create terraform.auto.tfvars
cat > terraform.auto.tfvars <<EOF
proxmox = {
  name         = "host3"
  cluster_name = "host3"
  endpoint     = "https://10.25.150.3:8006"
  insecure     = true
  username     = "root@pam"
  api_token    = "<PROXMOX_API_TOKEN>"
}
EOF

chmod 600 terraform.auto.tfvars

Step 12: Review and Apply Infrastructure

# Review what will be created
tofu plan

# Apply infrastructure
tofu apply

# Type 'yes' when prompted

This recreates:

All Talos Linux VMs
Control plane nodes (10.25.150.11-13)
Worker nodes (10.25.150.21-23)
Load balancers (10.25.150.5-6)

Phase 6: Bootstrap Kubernetes

Step 13: Bootstrap Talos Cluster

cd homelab/tofu

# Export talosconfig
export TALOSCONFIG=$(pwd)/outputs/talosconfig

# Bootstrap first control plane
talosctl bootstrap -n 10.25.150.11

# Wait for bootstrap (5-10 minutes)
talosctl -n 10.25.150.11 health --wait-timeout 10m

# Generate kubeconfig
talosctl -n 10.25.150.11 kubeconfig ~/.kube/config --force

# Verify cluster
kubectl config use-context talos
kubectl get nodes -w

Step 14: Deploy Core Infrastructure via OpenTofu

All Kubernetes infrastructure is now deployed automatically by OpenTofu during the cluster bootstrap process. After Talos bootstrap completes:

cd homelab/tofu

# If Bitwarden token is not in terraform.tfvars, create the secret manually
kubectl create secret generic bitwarden-access-token \
  --namespace external-secrets \
  --from-literal=token="<BITWARDEN_ACCESS_TOKEN>"

# Wait for OpenTofu bootstrap module to complete
# This installs Cert Manager, External Secrets Operator, ArgoCD, and ApplicationSets

# Verify ArgoCD is running
kubectl -n argocd get pods

# Verify ApplicationSets are created
kubectl get applicationsets -n argocd

# Wait for infrastructure ApplicationSet to sync
kubectl wait --for=jsonpath='{.status.sync.status}'=Synced application/infrastructure -n argocd --timeout=600s

# Verify core services are ready
kubectl get pods -n cert-manager
kubectl get pods -n external-secrets
kubectl get pods -n argocd

Phase 7: Restore Data from B2

Step 15: Deploy Velero and Verify Backups

# Verify Velero is running
kubectl -n velero get pods

# Check B2 backup location
kubectl -n velero get backupstoragelocations backblaze-b2

# List available backups
velero backup get --storage-location backblaze-b2

# Find latest backup
LATEST_BACKUP=$(velero backup get --storage-location backblaze-b2 \
  --selector backup-type=weekly-offsite \
  -o json | jq -r '.items | sort_by(.metadata.creationTimestamp) | .[-1].metadata.name')

echo "Latest backup: $LATEST_BACKUP"

# Check backup age
velero backup describe $LATEST_BACKUP | grep "Created:"

Step 16: Restore Applications and Data

# Restore from latest B2 backup
velero restore create site-loss-restore-$(date +%Y%m%d-%H%M%S) \
  --from-backup $LATEST_BACKUP \
  --storage-location backblaze-b2 \
  --exclude-namespaces velero,cert-manager,external-secrets,longhorn-system,kube-system,argocd

# Monitor restore
velero restore get
velero restore logs site-loss-restore-<timestamp> -f

# Watch pods
kubectl get pods -A -w

Step 17: Restore PostgreSQL Databases

For each CNPG cluster, create restore configuration:

Example template:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: <cluster-name>
  namespace: <namespace>
spec:
  instances: 2

  bootstrap:
    recovery:
      source: b2-backup
      recoveryTarget:
        targetImmediate: true

  externalClusters:
    - name: b2-backup
      barmanObjectStore:
        destinationPath: s3://homelab-cnpg-b2/<namespace>/<cluster-name>
        endpointURL: https://s3.us-west-002.backblazeb2.com
        s3Credentials:
          accessKeyId:
            name: b2-cnpg-credentials
            key: AWS_ACCESS_KEY_ID
          secretAccessKey:
            name: b2-cnpg-credentials
            key: AWS_SECRET_ACCESS_KEY
        wal:
          compression: gzip
          encryption: AES256

  storage:
    size: 20Gi
    storageClass: longhorn

Apply for each database:

kubectl apply -f restore-auth-db.yaml
kubectl apply -f restore-media-db.yaml
# ... etc

Validation

Infrastructure Health Check

# All nodes Ready
kubectl get nodes

# All infrastructure pods Running
kubectl get pods -A | grep -v Running | grep -v Completed

# Longhorn healthy
kubectl -n longhorn-system get volumes
# Access Longhorn UI and verify all volumes healthy

# All PVCs Bound
kubectl get pvc -A | grep -v Bound

Application Validation

# List all applications
kubectl get pods -A

# Check critical services
kubectl -n auth get pods
kubectl -n media get pods

# Verify databases
kubectl get clusters -A

# Test database connectivity
kubectl -n auth exec -it <postgres-pod> -- psql -U postgres -c "SELECT version();"

Data Integrity Check

Check recovery point:

# For each database, check latest data timestamp
kubectl -n <namespace> exec -it <postgres-pod> -- psql -U postgres <<EOF
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
EOF

# Check for any timestamped data
# SELECT MAX(created_at), MAX(updated_at) FROM <critical_table>;
# Compare to incident timestamp to understand data loss

External Access Validation

# Check ingress
kubectl get ingress -A

# Verify external DNS (if configured)
# Update DNS records if IP changed

# Test application access
curl -k https://<your-domain>

# Verify SSL certificates
kubectl get certificates -A

Post-Recovery Tasks

1. Comprehensive Incident Documentation

cat > ~/total-site-loss-incident.md <<EOF
# Total Site Loss Disaster Recovery Report

**Incident Date**: <date of disaster>
**Recovery Start**: <date recovery started>
**Recovery Complete**: <date services restored>
**Total RTO**: <hours from incident to full recovery>
**RPO (Data Loss)**: <days since last B2 backup>

## Incident Details
- Type: <fire/flood/natural disaster>
- Location: <address>
- Personal impact: <family status, housing>
- Equipment lost: <full inventory>

## What Survived
✓ GitHub repository (theepicsaxguy/homelab)
✓ Backblaze B2 backups
  - Last Velero backup: <date>
  - Last CNPG backup: <date>
  - OpenTofu state: <date>
✓ Bitwarden vault access
✓ Domain names and DNS

## What Was Lost
✗ All physical hardware
✗ Local MinIO backups
✗ TrueNAS and local storage
✗ Data created after: <last backup date>

## Recovery Timeline
<detailed hour-by-hour timeline>

## Financial Impact
- Hardware replacement: $<amount>
- Insurance coverage: $<amount>
- Out-of-pocket: $<amount>
- Cloud costs (temporary): $<amount>

## Data Loss Assessment
<detailed analysis of what data was lost>

## What Worked Well
- B2 backups were intact and restorable
- GitHub repository had all infrastructure code
- Bitwarden had all credentials
- Documentation was accessible
- OpenTofu state in B2 was critical

## What Could Be Improved
- More frequent B2 backups (weekly → daily)
- Printed emergency documentation off-site
- Spare hardware at alternate location
- Cloud-based DR environment ready to go
- Better documentation of manual steps

## Lessons Learned
<key takeaways>

## Follow-up Actions
- [ ] File insurance claim
- [ ] Update backup frequency
- [ ] Create printed DR runbook (store off-site)
- [ ] Set up cloud-based DR environment
- [ ] Document new hardware configuration
- [ ] Update monitoring and alerting
- [ ] Schedule quarterly DR drills
EOF

2. Implement Immediate Improvements

Increase backup frequency:

# Change Velero B2 backups from weekly to daily
kubectl -n velero edit schedule weekly-offsite-schedule

# Update schedule:
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM instead of weekly

# Rename schedule
# Or create new daily schedule

Add backup monitoring:

# Create alerts for:
# - Backup failures
# - Backup age > 48 hours
# - B2 bucket access issues

3. Create Off-Site Emergency Kit

Physical emergency kit (store at safe location):

[ ] Printed copy of this disaster recovery documentation
[ ] Network diagrams and IP addressing
[ ] Hardware configuration notes
[ ] Bitwarden emergency access instructions
[ ] GitHub account recovery codes
[ ] B2 account information
[ ] Domain registrar contact info
[ ] Insurance policy numbers
[ ] Emergency contact list
[ ] USB drive with:
    - Proxmox ISO
    - Talos Image
    - CLI tools (kubectl, tofu, etc.)

4. Consider Permanent DR Infrastructure

Options to prevent future total loss:

Cloud-based standby environment:

# Maintain dormant cluster in cloud
# Use cheap instances (can upscale when needed)
# Regular restore tests to cloud environment

Co-location or friend's house:

# Store spare server at alternate location
# Can be brought online quickly
# Shared homelab with trusted friend

Geographic replication:

# Run two clusters in different locations
# Primary + DR site with replication
# More complex but near-zero RTO

5. Update Financial Planning

# Budget for:
# - Spare hardware fund
# - Increased cloud costs (B2 storage, compute)
# - Insurance coverage review
# - Emergency hardware purchase capacity

6. Schedule Regular DR Drills

# Quarterly: Test restore from B2 to cloud environment
# Semi-annually: Full recovery drill with actual hardware
# Annually: Complete site loss scenario with new hardware

# Document each drill
# Update procedures based on findings
# Rotate responsibilities (if family member might need to recover)

Troubleshooting

Different Network Configuration Required

If you can't use 10.25.150.0/24:

# Update tofu/config.auto.tfvars
network = {
  gateway     = "192.168.1.1"      # New gateway
  vip         = "192.168.1.10"     # New VIP
  api_lb_vip  = "192.168.1.9"      # New API VIP
  cidr_prefix = 24
  dns_servers = ["192.168.1.1"]
  bridge      = "vmbr0"
  vlan_id     = 0                  # Disable VLAN if not supported
}

# Update nodes_config with new IPs
# Then apply with tofu

B2 Credentials Lost

If you can't access B2:

# Login to B2 web console
https://www.backblaze.com/b2/sign-in.html

# If credentials lost:
# 1. Use email recovery
# 2. Answer security questions
# 3. Contact B2 support with account verification

# Generate new application keys
# Update all secrets that use B2

GitHub Repository Inaccessible

If you can't clone GitHub repo:

# Option 1: Use HTTPS instead of SSH
git clone https://github.com/theepicsaxguy/homelab.git

# Option 2: Generate new SSH key
ssh-keygen -t ed25519 -C "your_email@example.com"
cat ~/.ssh/id_ed25519.pub
# Add to GitHub: Settings → SSH Keys

# Option 3: Use GitHub web interface
# Download repository as ZIP from GitHub.com

Bitwarden Access Lost

This is critical - Bitwarden contains all credentials:

# Try recovery:
# 1. Email recovery (if configured)
# 2. Emergency access (if configured)
# 3. Recovery codes (if printed/stored)

# If all else fails:
# - Contact B2 support for account recovery
# - Reset GitHub password via email
# - Create new Proxmox passwords
# - Manually recreate all secrets in cluster

Hardware Insufficient for Full Cluster

If you can only afford partial hardware:

# Option 1: Deploy smaller cluster
# Update tofu config to 1 control plane, 1 worker
# Reduce resource allocations

# Option 2: Cloud migration (temporary)
# Deploy to Hetzner/OVH/DigitalOcean
# Restore services
# Migrate back to hardware when available

# Option 3: Selective restore
# Restore only critical applications
# Leave non-essential services offline

Scenario 4: Rack Fire - Similar scenario, rack destroyed but house intact
Scenario 3: Host Failure - Similar recovery procedure but data survives
Scenario 6: Ransomware - If backups are compromised

Reference

Backblaze B2 Documentation
Velero Disaster Recovery
CNPG Backup and Recovery
Talos Disaster Recovery
OpenTofu Backend Configuration
Main disaster recovery guide: Disaster Recovery Overview

Emergency Contacts

Critical Services:

Backblaze Support: support@backblaze.com
GitHub Support: https://support.github.com
Bitwarden Support: https://bitwarden.com/contact/
Proxmox Community: https://forum.proxmox.com/

Personal Contacts:

Update with your emergency contacts
Technical friends who can help
Family members with Bitwarden emergency access
Insurance adjuster contact

Final Notes

Remember:

Your safety and well-being come first. Technical recovery can wait.
This is a documented, tested procedure. You have everything you need in GitHub + B2.
Data loss is limited to your RPO (up to 1 week). This is acceptable for a total loss scenario.
Insurance may cover hardware. Document everything for your claim.
The homelab will come back. It's just infrastructure and data - you're safe, and that's what matters.

You can do this. One step at a time.

Symptoms​

Impact Assessment​

Prerequisites​

CRITICAL: What You Need Access To​

Required Software (Install on your workstation)​

Decision: New Hardware Location​

Recovery Procedure​

Phase 1: Emergency Preparation​

Step 1: Ensure Personal Safety and Stability​

Step 2: Verify Access to Critical Accounts​

Step 3: Inventory What Survived​

Phase 2: Acquire Infrastructure​

Step 4: Procure New Hardware​

Step 5: Set Up Network Infrastructure​

Phase 3: Install Base Infrastructure​

Step 6: Install Proxmox VE​

Phase 4: Restore Infrastructure as Code​

Step 7: Clone GitHub Repository​

Step 8: Configure B2 Backend Access​

Step 9: Enable OpenTofu Remote Backend​

Step 10: Initialize OpenTofu with Remote State​

Phase 5: Deploy Infrastructure​

Step 11: Configure Proxmox Credentials​

Step 12: Review and Apply Infrastructure​

Phase 6: Bootstrap Kubernetes​

Step 13: Bootstrap Talos Cluster​

Step 14: Deploy Core Infrastructure via OpenTofu​

Phase 7: Restore Data from B2​

Step 15: Deploy Velero and Verify Backups​

Step 16: Restore Applications and Data​

Step 17: Restore PostgreSQL Databases​

Validation​

Infrastructure Health Check​

Application Validation​

Data Integrity Check​

External Access Validation​

Post-Recovery Tasks​

1. Comprehensive Incident Documentation​

2. Implement Immediate Improvements​

3. Create Off-Site Emergency Kit​

4. Consider Permanent DR Infrastructure​

5. Update Financial Planning​

6. Schedule Regular DR Drills​

Troubleshooting​

Different Network Configuration Required​

B2 Credentials Lost​

GitHub Repository Inaccessible​

Bitwarden Access Lost​

Hardware Insufficient for Full Cluster​

Related Scenarios​

Reference​

Emergency Contacts​

Final Notes​