openova/platform/cluster-autoscaler-hcloud
e3mrah cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
..
chart fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427) 2026-05-12 06:11:30 +04:00
blueprint.yaml fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
README.md fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965) 2026-05-05 16:21:59 +04:00

bp-cluster-autoscaler-hcloud

Catalyst Blueprint umbrella chart for the Kubernetes cluster-autoscaler configured with the Hetzner Cloud cloud-provider. Adds and removes Hetzner workers in response to FailedScheduling events on a Sovereign's k3s cluster.

Why

Per issue #767, a freshly-provisioned Sovereign reaches FailedScheduling the moment the bootstrap-kit's RAM aggregate exceeds the static worker pool the operator picked in the wizard. Live evidence (otech92): two cpx32 workers couldn't fit the external-secrets-webhook Pod because the bootstrap-kit consumed the full 16 GB. The fix is two-pronged:

  1. Pre-launch: the wizard's StepReview surfaces an estimated footprint so the operator picks a worker pool that fits.
  2. Runtime: this blueprint adds cluster-autoscaler so the Sovereign scales workers up/down on demand, bounded by the min/max operator chose at launch.

How it wires

  • Helm subchart: upstream kubernetes/autoscaler/cluster-autoscaler vendor-neutral, multi-cloud cluster-autoscaler. The Hetzner cloud provider ships in the same upstream container image.
  • Hetzner token: read at HelmRelease apply time from flux-system/cloud-credentials.hcloud-token (the canonical Secret cloud-init writes per ADR-0001 §11.3 — same Secret consumed by Crossplane provider-hcloud + provider-config-hcloud).
  • Node bootstrap (issue #921): cluster-autoscaler 1.32.x's Hetzner provider requires either HCLOUD_CLUSTER_CONFIG (per-pool JSON, base64) or HCLOUD_CLOUD_INIT (cloud-init.yaml, base64) — it FATALs at startup without one. This chart wires both via extraEnvSecrets against the rendered cluster-autoscaler/hetzner- node-config Secret. Per-Sovereign overlays populate the clusterAutoscalerHcloud.cloudInit value via Flux valuesFrom against flux-system/cloud-credentials.hcloud-cloud-init, which cloud-init at Phase 0 stamps with the base64 of the same worker cloud-init the Phase-0 worker fleet booted with.
  • Node group: a single canonical pool keyed off the Sovereign's worker SKU + region + cloud-init template. The pool's min is the operator's chosen worker count; max defaults to 10 (overridable per-Sovereign).
  • Scale-down: 10 minutes idle (cost-saving default).

What this blueprint does NOT do

  • It does not pre-create extra nodes. Phase 0 (tofu apply) only provisions the min worker count; cluster-autoscaler creates additional workers on-demand against the same Hetzner project.
  • It does not provision the OpenTofu node-pool template. That restructuring is tracked separately (see follow-up issue) — the MVP shipped in this PR pins the node-group config in chart values and assumes the existing single-pool topology.
  • It does not autoscale workloads. KEDA (event-driven workload autoscaling) and the kubernetes-builtin HPA (horizontal pod autoscaler) are layered on top; cluster-autoscaler handles the node dimension only.

Upstream pinning

Knob Value Notes
Chart cluster-autoscaler (kubernetes/autoscaler) 9.46.6 — current stable on 2026-05-04
App cluster-autoscaler 1.32.0 (matches k3s 1.31.x — within +/-1 minor of the Sovereign apiserver)
Cloud provider hetzner Built into upstream image

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every value is runtime-configurable; cluster overlays in clusters/<sovereign>/ MAY override any of them without rebuilding the OCI artifact.