openova/platform/velero
e3mrah cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
..
chart fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799) (#1846) 2026-05-19 00:43:13 +04:00
blueprint.yaml fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
README.md docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00

Velero

Kubernetes backup/restore for disaster recovery. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.5) — runs on every host cluster Catalyst manages. Backups land in the velero-backups bucket on SeaweedFS, which is Catalyst's unified S3 encapsulation layer; SeaweedFS's cold-tier policy automatically transitions backup objects to the configured cloud archival backend (Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc.) so backups survive cluster failure without any direct cloud-S3 call from Velero itself.

Status: Accepted | Updated: 2026-04-28


Overview

Velero provides Kubernetes-native backup. All Velero output goes to the same single S3 endpointseaweedfs.storage.svc:8333, bucket velero-backups. SeaweedFS handles the rest: hot-tier in-cluster for fast restore of recent backups; cold-tier in cloud archival storage for backups beyond the configured warm-window.

flowchart TB
    subgraph K8s["Kubernetes Cluster"]
        Velero[Velero]
        Apps[Applications]
        PVs[Persistent Volumes]
    end

    subgraph SW["SeaweedFS (in-cluster S3 encapsulation)"]
        Bucket[velero-backups bucket]
        TierMgr[Tier Manager]
    end

    subgraph Archival["Cloud archive backend (cold tier)"]
        R2[Cloudflare R2]
        S3[AWS S3]
        GCS[GCP GCS]
        Hetzner[Hetzner Object Storage]
        OCI[OCI Object Storage]
    end

    Apps --> Velero
    PVs --> Velero
    Velero -->|"Backup"| Bucket
    Bucket --> TierMgr
    TierMgr -->|"After warm window"| Archival

Why route through SeaweedFS

Property Direct cloud-S3 calls Through SeaweedFS encapsulation
Number of S3 endpoints in Catalyst components N (one per consumer × cloud) 1 (seaweedfs.storage.svc:8333)
Hot-restore latency for recent backups Cloud round-trip Near-zero (in-cluster cache)
Audit / lifecycle / encryption boundary Per-component One central boundary
Air-gap deployment Requires direct cloud reachability Works with SeaweedFS-only mode (see SRE §7)

Backups survive cluster failure because SeaweedFS's cold tier is the cloud archival backend, not the in-cluster volumes. Even if the entire host cluster is destroyed, backups beyond the warm window already live in the cold backend (R2 / Glacier / etc.) and a restoring SeaweedFS can read them through.


Storage Backend Options

Provider Availability Egress Fees Notes
Cloud Provider Storage Default Varies Hetzner, OCI, Huawei OBS
Cloudflare R2 Always available Free Zero egress, multi-cloud friendly
AWS S3 Available $0.09/GB Full featured
GCP GCS Available $0.12/GB Full featured

Default: Cloud provider's object storage (Hetzner Object Storage, OCI Object Storage, etc.)

Alternative: Cloudflare R2 for zero egress fees, useful for multi-cloud or egress-heavy scenarios.


Configuration

Cloudflare R2 (Zero Egress)

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: r2-backup
  namespace: velero
spec:
  provider: aws
  bucket: <org>-backups
  config:
    region: auto
    s3ForcePathStyle: "true"
    s3Url: https://<account-id>.r2.cloudflarestorage.com
  credential:
    name: r2-credentials
    key: cloud

AWS S3

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: s3-backup
  namespace: velero
spec:
  provider: aws
  bucket: <org>-backups
  config:
    region: us-east-1
  credential:
    name: aws-credentials
    key: cloud

GCP GCS

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: gcs-backup
  namespace: velero
spec:
  provider: gcp
  bucket: <org>-backups
  credential:
    name: gcp-credentials
    key: cloud

Backup Schedule

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - velero
      - kube-system
    includedResources:
      - "*"
    excludedResources:
      - events
      - events.events.k8s.io
    storageLocation: r2-backup
    ttl: 720h  # 30 days

Backup Strategy

Resource Schedule Retention
All namespaces Daily 2 AM 30 days
Databases (labels) Hourly 7 days
Secrets Daily 90 days
PVs (snapshots) Daily 14 days

Multi-Region Backup

flowchart TB
    subgraph Region1["Region 1"]
        V1[Velero]
        K1[Kubernetes]
    end

    subgraph Region2["Region 2"]
        V2[Velero]
        K2[Kubernetes]
    end

    subgraph Archival["Archival S3"]
        Bucket[Shared Bucket<br/>or Cross-Region Replication]
    end

    V1 -->|"Backup"| Bucket
    V2 -->|"Backup"| Bucket
    Bucket -->|"Restore"| V1
    Bucket -->|"Restore"| V2

Both regions can:

  • Backup to same bucket (different prefixes)
  • Restore from either region's backups
  • Use for cross-region disaster recovery

Restore Procedure

sequenceDiagram
    participant Op as Operator
    participant Velero as Velero
    participant S3 as Archival S3
    participant K8s as Kubernetes

    Op->>Velero: velero restore create
    Velero->>S3: Fetch backup
    S3->>Velero: Return backup data
    Velero->>K8s: Restore resources
    Velero->>K8s: Restore PV data
    K8s->>Op: Restoration complete

Commands

# List available backups
velero backup get

# Restore entire backup
velero restore create --from-backup daily-backup-20260116

# Restore specific namespace
velero restore create --from-backup daily-backup-20260116 \
  --include-namespaces databases

# Restore to different namespace
velero restore create --from-backup daily-backup-20260116 \
  --include-namespaces databases \
  --namespace-mappings databases:databases-restored

Operations

Check Backup Status

# List backups
velero backup get

# Describe specific backup
velero backup describe daily-backup-20260116

# Check backup logs
velero backup logs daily-backup-20260116

Verify Backup Location

# Check backup storage locations
velero backup-location get

# Verify connection
velero backup-location check r2-backup

Manual Backup

# Create manual backup
velero backup create manual-backup-$(date +%Y%m%d)

# Backup specific namespace
velero backup create db-backup-$(date +%Y%m%d) \
  --include-namespaces databases

Consequences

Positive:

  • K8s-native backup
  • Flexible storage backends
  • Zero egress with Cloudflare R2
  • Cross-region restore capability
  • Incremental backups

Negative:

  • Requires external S3 (by design)
  • PV backup requires CSI snapshots
  • Large restores take time

Part of OpenOva