* fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989)
t37 canonical walk on nbg1-2 / hel1-1 secondary CPs surfaced a second
stuck-HR failure mode: helm-controller completes the install — the HR's
own `.status.history[0].status` flips to "deployed" — but apiserver
flap on the slow secondary CP loses the write that flips
`.status.conditions[type=Ready]` from Unknown to True. The existing
suspend-toggle recovery (issue #925) does NOT fix this because helm-
controller's "release in storage" short-circuit returns yes on every
subsequent reconcile, so it never re-evaluates Ready.
This PR extends the stuckHelmReleaseRecovery CronJob with a second
detection branch:
for hr where
.status.conditions[type=Ready].status == "Unknown"
AND age(Unknown) > stuckThreshold (default 5m)
AND .status.history[0].status == "deployed"
AND metadata.annotations["stuck-hr-recovery.openova.io/auto-corrected-at"] == ""
→ kubectl annotate hr stuck-hr-recovery.openova.io/auto-corrected-at=<RFC3339>
→ kubectl patch hr --subresource=status --type=merge
status.conditions=[{type:Ready, status:True,
reason:ReconciliationSucceeded,
message:"auto-corrected from deployed-but-
unknown-Ready by stuck-hr-recovery
(TBD-A66)",
lastTransitionTime:<RFC3339>}]
Safety / idempotency:
- Annotation acts as both audit trail AND idempotency guard. Re-runs
on an already-corrected HR skip immediately.
- If the status patch fails, the annotation is rolled back so the
next CronJob run re-attempts.
- Guardrail unchanged: >10 acted-on HRs in a single run → exit 1 +
operator alert.
- The 10-HR guardrail spans BOTH branches combined.
RBAC additions:
- helmreleases/status with verbs [patch, update] — status subresource
is a separate RBAC target in Kubernetes. Without this rule
`kubectl patch --subresource=status` returns 403.
Validation:
- tests/leader-election-and-recovery.sh: 6 → 7 cases (existing 6
issue #925 cases still PASS; new Case 7 covers TBD-A66 — script
contains history[0].status check, status-subresource patch verb,
audit annotation key, helmreleases/status ClusterRole verb, and
operator-greppable "auto-corrected from deployed-but-unknown-Ready"
audit string).
- Mock JSONPath replay against 4 synthetic HRs: branch B routes
deployed-but-unknown to status patch, branch A still handles
pending-install via the secret check, idempotency annotation
correctly skips re-run, healthy Ready=True HR is no-op.
Chart bump:
- platform/flux/chart/Chart.yaml: 1.2.2 → 1.2.3
- clusters/_template/bootstrap-kit/03-flux.yaml: bp-flux HR pin
1.2.2 → 1.2.3 (the existing pin for omantel/otech live clusters
sits at 1.1.3 — unchanged, those clusters are pre-#925 baseline).
Closure note:
- Refs #1989 (not Closes — closure happens when the t37 canonical
walk reaches handover successfully on a fresh prov).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-flux): bump blueprint.yaml spec.version 1.2.2 → 1.2.3 (lockstep with Chart.yaml)
Companion to TBD-A66 / #1989 bump. CI gate
`TestBootstrapKit_BlueprintVersionLockstepSweep` (TBD-A20, #1856)
asserts blueprint.yaml spec.version == chart/Chart.yaml version per
platform/*. Missed this in the parent commit because the older bp-flux
bumps (1.2.1 → 1.2.2 etc.) did not require this companion bump back
when the lockstep gate didn't exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: claude-bot <claude-bot@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>