fix(bp-sme): wait for gitea user-bootstrap before provisioning starts (Closes #2002) (#2008)
Some checks are pending
Build & Deploy Catalyst / build-ui (push) Waiting to run
Build & Deploy Catalyst / build-api (push) Waiting to run
Build & Deploy Catalyst / deploy (push) Blocked by required conditions
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Cluster bootstrap-kit drift guardrail / Detect bootstrap-kit drift (push) Waiting to run
Phase-8a preflight C — Cilium Gateway HTTPRoute admission / Preflight Cilium HTTPRoute admission (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / pin-sync-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run

TBD-V11 / Issue #2002. On t38 fresh prov, sme/provisioning Pod logged
`HTTP 401 user does not exist [uid: 0, name: ""]` on the first tenant
Org CR creation. Root cause: provisioning Pod started with the chart's
first-install placeholder GITHUB_TOKEN (the Gitea admin password mirrored
verbatim by provisioning-github-token.yaml — enough to clear Container-
ConfigError but NOT a valid Gitea API token). Step 09 of bp-self-
sovereign-cutover later mints a real API token + patches the Secret
+ rollout-restarts the Pod, but the FIRST tenant journey always 401'd
because the Pod was already serving with the bad placeholder.

Approach (B): add an init container `wait-for-cutover-token` to the
SME provisioning Deployment that polls the Secret for the cutover
annotation `catalyst.openova.io/token-source: self-sovereign-cutover-
step-09` (stamped by Step 09 alongside the minted token bytes). The
Pod stays in Init:0/1 until Step 09 has actually completed, then the
main container starts with a guaranteed-valid token. Default poll
budget = 10s × 180 = 1800s (covers Hetzner cold-start ~18m + slack).

Why NOT HelmRelease.dependsOn:
- Per Principle #14, HR.dependsOn → Kustomization is silently ignored.
- bp-self-sovereign-cutover HR is dormant + disableWait:true: it goes
  Ready=True at install BEFORE Step 09's Job actually runs. Adding it
  to bp-catalyst-platform.dependsOn would buy nothing.
- Pod-level init gating waits on the actual condition (Secret
  annotation set by Step 09), not on a proxy.

Why NOT change bp-self-sovereign-cutover trigger order:
- Step 09 must run AFTER bp-catalyst-platform creates the Secret
  (otherwise the patch has no target). Reordering would break the
  inverse dependency.

Why NOT a Job that bootstraps the user upfront:
- Step 09 already mints the token; we don't need a second bootstrap.
- The bug is timing, not absence of bootstrap.

Files changed:
- products/catalyst/chart/templates/sme-services/provisioning.yaml:
  add initContainers block gated on
  smeServices.provisioning.waitForCutoverToken.enabled (default true).
  Re-uses existing `provisioning` SA (already has secrets get/list/watch
  in `sme` ns via sme-provisioning ClusterRole — no new RBAC).
- products/catalyst/chart/values.yaml: add
  smeServices.provisioning.waitForCutoverToken.{enabled,image,
  intervalSeconds,timeoutSeconds} block.
- products/catalyst/chart/Chart.yaml: bump 1.4.213 → 1.4.214 with
  full TBD-V11 changelog entry.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
  HelmRelease pin 1.4.213 → 1.4.214 (chart bump only delivers the fix
  when the pin moves — TBD-A68 / 1.4.213 precedent).

Validation:
- `helm template` Sovereign-mode render shows the init container in
  the provisioning Deployment with kubectl-poll loop.
- Default-values smoke render unaffected (gate is
  ingress.marketplace.enabled=true; smoke uses defaults where false).
- `helm lint products/catalyst/chart/` passes.
- Contabo-Zero render path safe by construction (chart only renders
  the Deployment when ingress.marketplace.enabled=true; contabo
  doesn't enable marketplace via this chart).

Closes #2002. Refs #1829 (D29 tenant materialisation gate).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-20 02:43:01 +04:00 committed by GitHub
parent a090477aa1
commit cdd7eac20a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 185 additions and 3 deletions

View File

@ -746,7 +746,24 @@ spec:
# PR; the code fix is upstream in #1910. The CI auto-bump-
# images job skipped controller images (TBD-A69 follow-up
# tracks closing that gap).
version: 1.4.213
# 1.4.214 (TBD-V11 / #2002): add init container
# `wait-for-cutover-token` to the SME provisioning Deployment.
# The Pod now blocks on Secret sme/provisioning-github-token
# carrying `catalyst.openova.io/token-source:
# self-sovereign-cutover-step-09` (set by Step 09 of bp-self-
# sovereign-cutover when the real Gitea API token is minted
# + patched). Pre-fix on t38 the Pod started with the
# first-install placeholder (gitea admin password) and the
# FIRST tenant Org CR creation hit 401 `user does not exist
# [uid: 0, name: ""]` from Gitea. Pod-level init gating is
# the correct waitpoint — Principle #14: HelmRelease.dependsOn
# → Kustomization is silently ignored, and the cutover HR is
# dormant + disableWait:true so HR-level dependsOn would
# resolve Ready=True before Step 09 ever runs. Configurable
# via .Values.smeServices.provisioning.waitForCutoverToken.*
# (default enabled on Sovereigns; contabo overlay flips
# enabled=false because Step 09 never runs on Catalyst-Zero).
version: 1.4.214
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -1,5 +1,51 @@
apiVersion: v2
name: bp-catalyst-platform
# 1.4.214 — TBD-V11 / #2002 (2026-05-20): add init container
# `wait-for-cutover-token` to the SME provisioning Deployment so the
# Pod does NOT accept tenant requests until bp-self-sovereign-cutover
# Step 09 (gitea-token-mint) has minted a real Gitea API token and
# patched Secret `sme/provisioning-github-token` with the annotation
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`.
# Pre-fix on t38 (2026-05-19): provisioning Pod started with the
# first-install placeholder (gitea-admin-secret.password) under
# GITHUB_TOKEN, tenant Org CR creation fired BEFORE Step 09 had
# patched, gitea returned `HTTP 401 user does not exist [uid: 0,
# name: ""]`, Org CR stuck at Ready=False/GiteaOrgFailed. Step 09's
# `rollout restart deploy/provisioning` eventually recovers — but
# the FIRST tenant journey after install always 401'd.
#
# Fix design rationale: HelmRelease.dependsOn between bp-catalyst-
# platform and bp-self-sovereign-cutover would NOT solve this — the
# cutover HR is dormant (`disableWait: true`) and goes Ready=True at
# install time, BEFORE Step 09's Job actually runs. Per Inviolable
# Principle #14 we also can't depend on Kustomizations from HRs.
# Pod-level init gating is the only correct waitpoint: the init
# container kubectl-polls the Secret in `sme` ns for the cutover
# annotation, exits 0 once present, and the main provisioning
# container starts with a guaranteed-valid token.
#
# Configurable knobs in values.yaml under
# `smeServices.provisioning.waitForCutoverToken`:
# enabled (true default; set false on Catalyst-Zero/contabo where
# the SealedSecret-provisioned token carries no cutover annotation
# and Step 09 never runs)
# image (alpine/k8s:1.31.4 — tracks platform/self-sovereign-cutover
# values.yaml so platform-wide bumps land in one place)
# intervalSeconds (10s default)
# timeoutSeconds (1800s = 30m default; covers the slowest cutover
# observed on Hetzner cold-start ~18 min with 12 min safety margin)
#
# RBAC: re-uses the existing `provisioning` ServiceAccount in `sme`,
# which already carries `secrets: [get, list, watch]` via the
# `sme-provisioning` ClusterRole. No new RBAC.
#
# Validation: `helm template` shows the initContainers block on
# Sovereign installs; smoke render unaffected on contabo (where
# `waitForCutoverToken.enabled: false` is the canonical overlay).
# On t38 (or fresh prov) the sme/provisioning Pod stays in
# `Init:0/1` until Step 09 stamps the annotation, then starts the
# main container with the minted token; first tenant Org CR
# materialises Ready=True/Materialized without retry.
# 1.4.213 — TBD-A68 / #1997 (2026-05-20): bump organization-controller
# image pin 72e3f08 → c9b58ea so the chart actually ships PR #1910's
# gitea-client fix (POST /api/v1/orgs instead of the admin-only
@ -1851,8 +1897,8 @@ name: bp-catalyst-platform
# was already shipped on 1.4.197 (PR #1820 lineage); this completes
# the data-layer side so the dropdown finally appears on multi-region
# Sovereigns. Refs #1821, DoD D20.
version: 1.4.213
appVersion: 1.4.213
version: 1.4.214
appVersion: 1.4.214
# 1.4.183 — fix(httproute): omit default sectionName so multi-zone
# Sovereigns attach via Cilium Gateway hostname matcher (Closes #1884,
# TBD-A30). Pre-1.4.183 every catalyst-system HTTPRoute pinned

View File

@ -147,6 +147,97 @@ spec:
serviceAccountName: provisioning
imagePullSecrets:
- name: ghcr-pull
# ── Init container: wait-for-cutover-token (TBD-V11, issue #2002) ──
# The provisioning-github-token Secret has TWO lifecycle states:
# 1. First-install bootstrap (this chart's
# provisioning-github-token.yaml writes the Gitea admin PASSWORD
# under GITHUB_TOKEN — enough to boot the Pod past
# CreateContainerConfigError but NOT a valid Gitea API token).
# 2. Cutover-owned (bp-self-sovereign-cutover Step 09 mints a real
# Gitea API token via POST /api/v1/users/{user}/tokens and
# patches this Secret + stamps the annotation
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`).
#
# Pre-fix on t38 (2026-05-19): provisioning Pod started at state (1),
# tenant Org CR creation fired BEFORE Step 09 had patched the Secret,
# gitea API call sent `Authorization: token <admin_password>`, Gitea
# 401'd with `user does not exist [uid: 0, name: ""]`, Org CR stuck
# at Ready=False/GiteaOrgFailed. Step 09's `rollout restart deploy/
# provisioning` recovers eventually, but the Pod accepting traffic
# in state (1) means the FIRST tenant request after install always
# fails until the rollout completes.
#
# Fix: gate the provisioning Pod on the Secret carrying the cutover
# annotation. Until Step 09 has minted + patched, the init container
# polls every {{ .Values.smeServices.provisioning.waitForCutoverToken.intervalSeconds | default 10 }}s
# (up to {{ .Values.smeServices.provisioning.waitForCutoverToken.timeoutSeconds | default 1800 }}s)
# and only exits 0 once `catalyst.openova.io/token-source` equals
# `self-sovereign-cutover-step-09` on Secret
# `{{ .Values.smeServices.provisioning.githubToken.secretName | default "provisioning-github-token" }}`
# in this namespace. On Catalyst-Zero (contabo) the SealedSecret-
# provisioned token carries no cutover annotation — operators set
# `.Values.smeServices.provisioning.waitForCutoverToken.enabled: false`
# via clusters/contabo-mkt/ overlay so the init container is a no-op
# there (defaults to true; Sovereigns inherit the gate).
#
# Per Inviolable Principle #14: this is Pod-level gating via init
# container, NOT HelmRelease.dependsOn → Kustomization (which Flux
# silently ignores). The cutover HR is dormant + disableWait:true so
# an HR-level dependsOn would resolve Ready=True at install time,
# BEFORE Step 09 actually runs — that approach buys nothing.
# Pod-level init gating waits on the actual condition (Secret
# annotation), not on a proxy.
{{- if .Values.smeServices.provisioning.waitForCutoverToken.enabled }}
initContainers:
- name: wait-for-cutover-token
image: {{ .Values.smeServices.provisioning.waitForCutoverToken.image | default "harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4" | quote }}
imagePullPolicy: IfNotPresent
env:
- name: TARGET_NAMESPACE
value: {{ .Values.smeServices.provisioning.gitToken.destNamespace | default "sme" | quote }}
- name: SECRET_NAME
value: {{ .Values.smeServices.provisioning.githubToken.secretName | default "provisioning-github-token" | quote }}
- name: TOKEN_SOURCE_ANNOTATION_VALUE
value: "self-sovereign-cutover-step-09"
- name: INTERVAL_SECONDS
value: {{ .Values.smeServices.provisioning.waitForCutoverToken.intervalSeconds | default 10 | quote }}
- name: TIMEOUT_SECONDS
value: {{ .Values.smeServices.provisioning.waitForCutoverToken.timeoutSeconds | default 1800 | quote }}
command: ["/bin/sh", "-c"]
args:
- |
set -eu
echo "[wait-for-cutover-token] gating on Secret ${TARGET_NAMESPACE}/${SECRET_NAME} carrying annotation catalyst.openova.io/token-source=${TOKEN_SOURCE_ANNOTATION_VALUE}"
deadline=$(( $(date +%s) + TIMEOUT_SECONDS ))
while [ "$(date +%s)" -lt "${deadline}" ]; do
# `kubectl get ... -o jsonpath` returns the empty string
# (not an error) if the annotation is missing — exactly
# the placeholder state we want to wait through.
token_source=$(kubectl -n "${TARGET_NAMESPACE}" get secret "${SECRET_NAME}" \
-o "jsonpath={.metadata.annotations.catalyst\.openova\.io/token-source}" 2>/dev/null || true)
if [ "${token_source}" = "${TOKEN_SOURCE_ANNOTATION_VALUE}" ]; then
echo "[wait-for-cutover-token] cutover annotation present — Step 09 has minted the real Gitea API token; proceeding"
exit 0
fi
echo "[wait-for-cutover-token] annotation not yet set (current: '${token_source}'); retrying in ${INTERVAL_SECONDS}s"
sleep "${INTERVAL_SECONDS}"
done
# Timeout is a HARD FAIL — provisioning with the placeholder
# admin password as GITHUB_TOKEN cannot satisfy ANY tenant
# request. Better to keep the Pod in Init: than to serve 401s.
echo "[wait-for-cutover-token] FATAL: timed out after ${TIMEOUT_SECONDS}s waiting for cutover Step 09 to patch ${TARGET_NAMESPACE}/${SECRET_NAME}" >&2
exit 1
resources:
requests: { cpu: 10m, memory: 32Mi }
limits: { memory: 128Mi }
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 65534
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
{{- end }}
containers:
- name: provisioning
image: "{{ if .Values.global.imageRegistry }}{{ .Values.global.imageRegistry }}{{ else }}{{ .Values.images.registry }}{{ end }}/{{ .Values.images.organization }}/services-provisioning:{{ .Values.images.smeTag }}"

View File

@ -1267,6 +1267,34 @@ smeServices:
githubToken:
secretName: provisioning-github-token
secretKey: GITHUB_TOKEN
# waitForCutoverToken: gate the provisioning Pod on bp-self-sovereign-
# cutover Step 09 having patched the Secret with a REAL Gitea API
# token (TBD-V11, issue #2002). Without this gate, the Pod starts with
# the chart's first-install bootstrap value (Gitea admin password,
# which Gitea rejects as an API token → 401 "user does not exist")
# and the FIRST tenant Org CR creation after install fails. The
# init container polls the Secret in the same namespace for the
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`
# annotation that Step 09 stamps.
#
# On Sovereigns: leave enabled (the cutover Job will mint and
# patch within minutes of bp-catalyst-platform reconciling).
# On Catalyst-Zero (contabo): set enabled=false in
# clusters/contabo-mkt/ overlay — the SealedSecret-provisioned token
# carries no cutover annotation and Step 09 never runs there.
waitForCutoverToken:
enabled: true
# image: kubectl client used by the init container. Default tracks
# platform/self-sovereign-cutover/chart/values.yaml's kubectl image
# so a platform-wide bump only needs to land in one place at a time.
image: harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4
# intervalSeconds + timeoutSeconds: poll cadence + wall-clock cap.
# Default 10s × 180 iters = 1800s (30m) — covers the slowest cutover
# observed on Hetzner cold-start (~18 min) with 12 min safety margin.
# The Pod stays in Init: until Step 09 runs; failing closed is the
# correct behaviour because serving 401s blocks every tenant journey.
intervalSeconds: 10
timeoutSeconds: 1800
# git.{apiURL,owner,repo,branch}: Git host coordinates. The
# provisioning binary uses GITHUB_API_URL when non-empty (Sovereign
# path → in-cluster Gitea REST API) and otherwise falls back to the