A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/*, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. * Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| chart | ||
| blueprint.yaml | ||
| README.md | ||
Velero
Kubernetes backup/restore for disaster recovery. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.5) — runs on every host cluster Catalyst manages. Backups land in the velero-backups bucket on SeaweedFS, which is Catalyst's unified S3 encapsulation layer; SeaweedFS's cold-tier policy automatically transitions backup objects to the configured cloud archival backend (Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc.) so backups survive cluster failure without any direct cloud-S3 call from Velero itself.
Status: Accepted | Updated: 2026-04-28
Overview
Velero provides Kubernetes-native backup. All Velero output goes to the same single S3 endpoint — seaweedfs.storage.svc:8333, bucket velero-backups. SeaweedFS handles the rest: hot-tier in-cluster for fast restore of recent backups; cold-tier in cloud archival storage for backups beyond the configured warm-window.
flowchart TB
subgraph K8s["Kubernetes Cluster"]
Velero[Velero]
Apps[Applications]
PVs[Persistent Volumes]
end
subgraph SW["SeaweedFS (in-cluster S3 encapsulation)"]
Bucket[velero-backups bucket]
TierMgr[Tier Manager]
end
subgraph Archival["Cloud archive backend (cold tier)"]
R2[Cloudflare R2]
S3[AWS S3]
GCS[GCP GCS]
Hetzner[Hetzner Object Storage]
OCI[OCI Object Storage]
end
Apps --> Velero
PVs --> Velero
Velero -->|"Backup"| Bucket
Bucket --> TierMgr
TierMgr -->|"After warm window"| Archival
Why route through SeaweedFS
| Property | Direct cloud-S3 calls | Through SeaweedFS encapsulation |
|---|---|---|
| Number of S3 endpoints in Catalyst components | N (one per consumer × cloud) | 1 (seaweedfs.storage.svc:8333) |
| Hot-restore latency for recent backups | Cloud round-trip | Near-zero (in-cluster cache) |
| Audit / lifecycle / encryption boundary | Per-component | One central boundary |
| Air-gap deployment | Requires direct cloud reachability | Works with SeaweedFS-only mode (see SRE §7) |
Backups survive cluster failure because SeaweedFS's cold tier is the cloud archival backend, not the in-cluster volumes. Even if the entire host cluster is destroyed, backups beyond the warm window already live in the cold backend (R2 / Glacier / etc.) and a restoring SeaweedFS can read them through.
Storage Backend Options
| Provider | Availability | Egress Fees | Notes |
|---|---|---|---|
| Cloud Provider Storage | Default | Varies | Hetzner, OCI, Huawei OBS |
| Cloudflare R2 | Always available | Free | Zero egress, multi-cloud friendly |
| AWS S3 | Available | $0.09/GB | Full featured |
| GCP GCS | Available | $0.12/GB | Full featured |
Default: Cloud provider's object storage (Hetzner Object Storage, OCI Object Storage, etc.)
Alternative: Cloudflare R2 for zero egress fees, useful for multi-cloud or egress-heavy scenarios.
Configuration
Cloudflare R2 (Zero Egress)
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: r2-backup
namespace: velero
spec:
provider: aws
bucket: <org>-backups
config:
region: auto
s3ForcePathStyle: "true"
s3Url: https://<account-id>.r2.cloudflarestorage.com
credential:
name: r2-credentials
key: cloud
AWS S3
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: s3-backup
namespace: velero
spec:
provider: aws
bucket: <org>-backups
config:
region: us-east-1
credential:
name: aws-credentials
key: cloud
GCP GCS
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: gcs-backup
namespace: velero
spec:
provider: gcp
bucket: <org>-backups
credential:
name: gcp-credentials
key: cloud
Backup Schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- "*"
excludedNamespaces:
- velero
- kube-system
includedResources:
- "*"
excludedResources:
- events
- events.events.k8s.io
storageLocation: r2-backup
ttl: 720h # 30 days
Backup Strategy
| Resource | Schedule | Retention |
|---|---|---|
| All namespaces | Daily 2 AM | 30 days |
| Databases (labels) | Hourly | 7 days |
| Secrets | Daily | 90 days |
| PVs (snapshots) | Daily | 14 days |
Multi-Region Backup
flowchart TB
subgraph Region1["Region 1"]
V1[Velero]
K1[Kubernetes]
end
subgraph Region2["Region 2"]
V2[Velero]
K2[Kubernetes]
end
subgraph Archival["Archival S3"]
Bucket[Shared Bucket<br/>or Cross-Region Replication]
end
V1 -->|"Backup"| Bucket
V2 -->|"Backup"| Bucket
Bucket -->|"Restore"| V1
Bucket -->|"Restore"| V2
Both regions can:
- Backup to same bucket (different prefixes)
- Restore from either region's backups
- Use for cross-region disaster recovery
Restore Procedure
sequenceDiagram
participant Op as Operator
participant Velero as Velero
participant S3 as Archival S3
participant K8s as Kubernetes
Op->>Velero: velero restore create
Velero->>S3: Fetch backup
S3->>Velero: Return backup data
Velero->>K8s: Restore resources
Velero->>K8s: Restore PV data
K8s->>Op: Restoration complete
Commands
# List available backups
velero backup get
# Restore entire backup
velero restore create --from-backup daily-backup-20260116
# Restore specific namespace
velero restore create --from-backup daily-backup-20260116 \
--include-namespaces databases
# Restore to different namespace
velero restore create --from-backup daily-backup-20260116 \
--include-namespaces databases \
--namespace-mappings databases:databases-restored
Operations
Check Backup Status
# List backups
velero backup get
# Describe specific backup
velero backup describe daily-backup-20260116
# Check backup logs
velero backup logs daily-backup-20260116
Verify Backup Location
# Check backup storage locations
velero backup-location get
# Verify connection
velero backup-location check r2-backup
Manual Backup
# Create manual backup
velero backup create manual-backup-$(date +%Y%m%d)
# Backup specific namespace
velero backup create db-backup-$(date +%Y%m%d) \
--include-namespaces databases
Consequences
Positive:
- K8s-native backup
- Flexible storage backends
- Zero egress with Cloudflare R2
- Cross-region restore capability
- Incremental backups
Negative:
- Requires external S3 (by design)
- PV backup requires CSI snapshots
- Large restores take time
Part of OpenOva