openova/platform
e3mrah 71e8101363
fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989) (#1991)
* fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989)

t37 canonical walk on nbg1-2 / hel1-1 secondary CPs surfaced a second
stuck-HR failure mode: helm-controller completes the install — the HR's
own `.status.history[0].status` flips to "deployed" — but apiserver
flap on the slow secondary CP loses the write that flips
`.status.conditions[type=Ready]` from Unknown to True. The existing
suspend-toggle recovery (issue #925) does NOT fix this because helm-
controller's "release in storage" short-circuit returns yes on every
subsequent reconcile, so it never re-evaluates Ready.

This PR extends the stuckHelmReleaseRecovery CronJob with a second
detection branch:

  for hr where
    .status.conditions[type=Ready].status == "Unknown"
    AND age(Unknown) > stuckThreshold (default 5m)
    AND .status.history[0].status == "deployed"
    AND metadata.annotations["stuck-hr-recovery.openova.io/auto-corrected-at"] == ""
  → kubectl annotate hr stuck-hr-recovery.openova.io/auto-corrected-at=<RFC3339>
  → kubectl patch hr --subresource=status --type=merge
       status.conditions=[{type:Ready, status:True,
                           reason:ReconciliationSucceeded,
                           message:"auto-corrected from deployed-but-
                                    unknown-Ready by stuck-hr-recovery
                                    (TBD-A66)",
                           lastTransitionTime:<RFC3339>}]

Safety / idempotency:
  - Annotation acts as both audit trail AND idempotency guard. Re-runs
    on an already-corrected HR skip immediately.
  - If the status patch fails, the annotation is rolled back so the
    next CronJob run re-attempts.
  - Guardrail unchanged: >10 acted-on HRs in a single run → exit 1 +
    operator alert.
  - The 10-HR guardrail spans BOTH branches combined.

RBAC additions:
  - helmreleases/status with verbs [patch, update] — status subresource
    is a separate RBAC target in Kubernetes. Without this rule
    `kubectl patch --subresource=status` returns 403.

Validation:
  - tests/leader-election-and-recovery.sh: 6 → 7 cases (existing 6
    issue #925 cases still PASS; new Case 7 covers TBD-A66 — script
    contains history[0].status check, status-subresource patch verb,
    audit annotation key, helmreleases/status ClusterRole verb, and
    operator-greppable "auto-corrected from deployed-but-unknown-Ready"
    audit string).
  - Mock JSONPath replay against 4 synthetic HRs: branch B routes
    deployed-but-unknown to status patch, branch A still handles
    pending-install via the secret check, idempotency annotation
    correctly skips re-run, healthy Ready=True HR is no-op.

Chart bump:
  - platform/flux/chart/Chart.yaml: 1.2.2 → 1.2.3
  - clusters/_template/bootstrap-kit/03-flux.yaml: bp-flux HR pin
    1.2.2 → 1.2.3 (the existing pin for omantel/otech live clusters
    sits at 1.1.3 — unchanged, those clusters are pre-#925 baseline).

Closure note:
  - Refs #1989 (not Closes — closure happens when the t37 canonical
    walk reaches handover successfully on a fresh prov).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-flux): bump blueprint.yaml spec.version 1.2.2 → 1.2.3 (lockstep with Chart.yaml)

Companion to TBD-A66 / #1989 bump. CI gate
`TestBootstrapKit_BlueprintVersionLockstepSweep` (TBD-A20, #1856)
asserts blueprint.yaml spec.version == chart/Chart.yaml version per
platform/*. Missed this in the parent commit because the older bp-flux
bumps (1.2.1 → 1.2.2 etc.) did not require this companion bump back
when the lockstep gate didn't exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: claude-bot <claude-bot@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:38:25 +04:00
..
alloy fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
anthropic-adapter feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288) 2026-04-30 19:37:19 +04:00
bge feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
bp-dmz-vcluster fix(charts): resolve bp-dmz-vcluster duplicate-name pseudo-drift (Closes A6c) (#1771) 2026-05-18 20:31:41 +04:00
bp-mgmt-vcluster fix(bp-mgmt-vcluster): namespace PSS baseline (was restricted) — A4 (#1535) 2026-05-16 18:06:59 +04:00
bp-rtz-vcluster fix(blueprints): vcluster charts smoke-render annotation = "default-off" (#1527) 2026-05-16 16:15:51 +04:00
bp-vcluster-helmrepo fix(bp-vcluster-helmrepo): install vclusters.vcluster.com CRD on fresh prov (Refs #1945) (#1964) 2026-05-19 19:25:34 +04:00
cert-manager fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855) 2026-05-19 00:34:48 +04:00
cert-manager-dynadot-webhook fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
cert-manager-powerdns-webhook fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
cilium fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855) 2026-05-19 00:34:48 +04:00
clickhouse docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
cluster-autoscaler-hcloud fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
cnpg fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
cnpg-pair fix(cnpg-pair tests): exclude helm-test hook resources from non-test count (#1225) 2026-05-09 23:51:08 +04:00
coraza fix(bp-coraza,bp-syft-grype): add common library subchart to satisfy hollow-chart gate (#220) 2026-04-30 06:15:28 +02:00
crossplane fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819) 2026-05-04 22:32:49 +04:00
crossplane-claims fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
debezium docs(pass-32): registry-DNS sweep — harbor.<domain> across 9 component READMEs 2026-04-27 22:36:39 +02:00
external-dns fix(bp-external-dns): apiserver Endpoints sync timeout — Cilium kube-apiserver entity required (closes #770) (#771) 2026-05-04 19:27:17 +04:00
external-secrets fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
external-secrets-stores fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
failover-controller refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171) 2026-04-29 08:51:09 +02:00
falco fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
ferretdb docs(pass-11b): retry banners on failover-controller/trivy/clickhouse/ferretdb (Edit needed Read first) 2026-04-27 21:45:56 +02:00
flink docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
flux fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989) (#1991) 2026-05-20 00:38:25 +04:00
gateway-api fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592) 2026-05-02 15:20:05 +04:00
gitea fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
grafana fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
guacamole deploy(bp-guacamole): bump bootstrap-kit pin 0.1.25 -> 0.1.26 (auto, Refs TBD-A6) 2026-05-18 22:20:35 +00:00
harbor deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.19 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 2) 2026-05-19 04:03:38 +00:00
hcloud-ccm fix(infra): hcloud-CCM + cilium DNS hardening + chart-side gitea token — qa-loop iter-12 Fix #54 (#1281) 2026-05-10 11:56:50 +04:00
hcloud-csi fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
iceberg docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
k8s-ws-proxy fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
keda docs(pass-10): banners on 7 more components + opentofu active-active drift fix 2026-04-27 21:43:45 +02:00
keycloak fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
knative feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00
kserve feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00
kyverno fix(bp-kyverno): install 19 compliance ClusterPolicies on fresh Sovereign (TBD-V3, Closes #1929) (#1933) 2026-05-19 12:20:34 +04:00
langfuse fix(bp-langfuse): drop apostrophe from description to clear GHCR 500 (resolves #215) (#278) 2026-04-30 17:31:51 +04:00
librechat feat(charts): bp-librechat wrapper chart (closes #275) (#287) 2026-04-30 18:56:59 +04:00
litmus feat(platform): security umbrellas (falco/kyverno/trivy/sigstore/syft-grype/reloader/coraza/litmus) (#216) 2026-04-30 06:07:38 +02:00
livekit feat(charts): bp-openmeter (CH-less) + bp-livekit + bp-matrix wrapper charts (closes #272 #273 #274) (#289) 2026-04-30 19:37:28 +04:00
llm-gateway feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288) 2026-04-30 19:37:19 +04:00
loki feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214) 2026-04-29 22:11:04 +02:00
matrix deploy(bp-matrix): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1) 2026-05-19 04:03:36 +00:00
milvus docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
mimir fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
nats-jetstream feat(bp-nats-jetstream): land Stream + KV CR templates (slice H4, #1095) (#1114) 2026-05-08 22:32:54 +04:00
nemo-guardrails feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
neo4j docs(pass-12): role-in-Catalyst banners on 11 AI/ML Application Blueprints 2026-04-27 21:47:45 +02:00
netbird fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
network-policies feat(bp-network-policies): land default-deny CCNP + system-namespace + DNS allow templates (slice H8, #1095) (#1116) 2026-05-08 22:40:30 +04:00
newapi deploy(bp-newapi): bump bootstrap-kit pin 1.4.29 -> 1.4.30 (auto, Refs TBD-A6) 2026-05-19 14:18:00 +00:00
openbao fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
openclaw fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
openmeter deploy(bp-openmeter): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1) 2026-05-19 04:03:39 +00:00
openova-flow-emitter/chart chore(deploy): bump openova-flow-adapter-flux image to 00eeff2 [skip ci] 2026-05-18 12:17:03 +00:00
openova-flow-server/chart chore(deploy): bump openova-flow-server image to fab091f [skip ci] 2026-05-18 13:23:07 +00:00
opensearch docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
opentelemetry feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214) 2026-04-29 22:11:04 +02:00
opentelemetry-operator feat(bp-opentelemetry-operator): scaffold operator + default Instrumentation CR (slice H5, #1095) (#1121) 2026-05-08 23:06:29 +04:00
opentofu refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171) 2026-04-29 08:51:09 +02:00
powerdns fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
qa-app fix(bp-qa-app): annotate no-upstream to satisfy hollow-chart guard (#1261) 2026-05-10 04:51:13 +04:00
reflector/chart fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554) 2026-05-02 12:17:51 +04:00
reloader fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (closes #715) (#716) 2026-05-04 02:04:26 +04:00
sandbox/chart deploy: bump sandbox-pty-server image to d55ed45 2026-05-19 20:38:16 +00:00
sealed-secrets fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819) 2026-05-04 22:32:49 +04:00
seaweedfs fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
self-sovereign-cutover fix(bp-gitea+bp-harbor): shorten mirror interval to 5m for post-cutover freshness (TBD-A37, Closes #1899) (#1916) 2026-05-19 10:42:11 +04:00
sigstore feat(platform): security umbrellas (falco/kyverno/trivy/sigstore/syft-grype/reloader/coraza/litmus) (#216) 2026-04-30 06:07:38 +02:00
spire fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819) 2026-05-04 22:32:49 +04:00
stalwart docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00
stalwart-sovereign feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931) 2026-05-05 14:20:16 +04:00
stalwart-tenant fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
strimzi docs(pass-35): completion sweep for surviving DNS placeholders (8 components) 2026-04-27 22:46:16 +02:00
stunner feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290) 2026-04-30 19:37:38 +04:00
syft-grype fix(bp-coraza,bp-syft-grype): add common library subchart to satisfy hollow-chart gate (#220) 2026-04-30 06:15:28 +02:00
tempo feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214) 2026-04-29 22:11:04 +02:00
temporal feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288) 2026-04-30 19:37:19 +04:00
trivy fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
valkey fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
velero fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
vllm feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
vpa fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858) 2026-05-19 01:04:22 +04:00
wordpress-tenant feat(catalog-seed): add bp-cnpg-pair Blueprint + wordpress-tenant active-hot-standby mode (Refs TBD-E8b, TBD-B31) (#1717) 2026-05-18 19:08:05 +04:00