fix(bp-sme): wait for gitea user-bootstrap before provisioning starts (Closes #2002) (#2008)
Some checks are pending
Build & Deploy Catalyst / build-ui (push) Waiting to run
Build & Deploy Catalyst / build-api (push) Waiting to run
Build & Deploy Catalyst / deploy (push) Blocked by required conditions
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Cluster bootstrap-kit drift guardrail / Detect bootstrap-kit drift (push) Waiting to run
Phase-8a preflight C — Cilium Gateway HTTPRoute admission / Preflight Cilium HTTPRoute admission (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / pin-sync-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run
Some checks are pending
Build & Deploy Catalyst / build-ui (push) Waiting to run
Build & Deploy Catalyst / build-api (push) Waiting to run
Build & Deploy Catalyst / deploy (push) Blocked by required conditions
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Cluster bootstrap-kit drift guardrail / Detect bootstrap-kit drift (push) Waiting to run
Phase-8a preflight C — Cilium Gateway HTTPRoute admission / Preflight Cilium HTTPRoute admission (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / pin-sync-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run
TBD-V11 / Issue #2002. On t38 fresh prov, sme/provisioning Pod logged `HTTP 401 user does not exist [uid: 0, name: ""]` on the first tenant Org CR creation. Root cause: provisioning Pod started with the chart's first-install placeholder GITHUB_TOKEN (the Gitea admin password mirrored verbatim by provisioning-github-token.yaml — enough to clear Container- ConfigError but NOT a valid Gitea API token). Step 09 of bp-self- sovereign-cutover later mints a real API token + patches the Secret + rollout-restarts the Pod, but the FIRST tenant journey always 401'd because the Pod was already serving with the bad placeholder. Approach (B): add an init container `wait-for-cutover-token` to the SME provisioning Deployment that polls the Secret for the cutover annotation `catalyst.openova.io/token-source: self-sovereign-cutover- step-09` (stamped by Step 09 alongside the minted token bytes). The Pod stays in Init:0/1 until Step 09 has actually completed, then the main container starts with a guaranteed-valid token. Default poll budget = 10s × 180 = 1800s (covers Hetzner cold-start ~18m + slack). Why NOT HelmRelease.dependsOn: - Per Principle #14, HR.dependsOn → Kustomization is silently ignored. - bp-self-sovereign-cutover HR is dormant + disableWait:true: it goes Ready=True at install BEFORE Step 09's Job actually runs. Adding it to bp-catalyst-platform.dependsOn would buy nothing. - Pod-level init gating waits on the actual condition (Secret annotation set by Step 09), not on a proxy. Why NOT change bp-self-sovereign-cutover trigger order: - Step 09 must run AFTER bp-catalyst-platform creates the Secret (otherwise the patch has no target). Reordering would break the inverse dependency. Why NOT a Job that bootstraps the user upfront: - Step 09 already mints the token; we don't need a second bootstrap. - The bug is timing, not absence of bootstrap. Files changed: - products/catalyst/chart/templates/sme-services/provisioning.yaml: add initContainers block gated on smeServices.provisioning.waitForCutoverToken.enabled (default true). Re-uses existing `provisioning` SA (already has secrets get/list/watch in `sme` ns via sme-provisioning ClusterRole — no new RBAC). - products/catalyst/chart/values.yaml: add smeServices.provisioning.waitForCutoverToken.{enabled,image, intervalSeconds,timeoutSeconds} block. - products/catalyst/chart/Chart.yaml: bump 1.4.213 → 1.4.214 with full TBD-V11 changelog entry. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump HelmRelease pin 1.4.213 → 1.4.214 (chart bump only delivers the fix when the pin moves — TBD-A68 / 1.4.213 precedent). Validation: - `helm template` Sovereign-mode render shows the init container in the provisioning Deployment with kubectl-poll loop. - Default-values smoke render unaffected (gate is ingress.marketplace.enabled=true; smoke uses defaults where false). - `helm lint products/catalyst/chart/` passes. - Contabo-Zero render path safe by construction (chart only renders the Deployment when ingress.marketplace.enabled=true; contabo doesn't enable marketplace via this chart). Closes #2002. Refs #1829 (D29 tenant materialisation gate). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
a090477aa1
commit
cdd7eac20a
@ -746,7 +746,24 @@ spec:
|
||||
# PR; the code fix is upstream in #1910. The CI auto-bump-
|
||||
# images job skipped controller images (TBD-A69 follow-up
|
||||
# tracks closing that gap).
|
||||
version: 1.4.213
|
||||
# 1.4.214 (TBD-V11 / #2002): add init container
|
||||
# `wait-for-cutover-token` to the SME provisioning Deployment.
|
||||
# The Pod now blocks on Secret sme/provisioning-github-token
|
||||
# carrying `catalyst.openova.io/token-source:
|
||||
# self-sovereign-cutover-step-09` (set by Step 09 of bp-self-
|
||||
# sovereign-cutover when the real Gitea API token is minted
|
||||
# + patched). Pre-fix on t38 the Pod started with the
|
||||
# first-install placeholder (gitea admin password) and the
|
||||
# FIRST tenant Org CR creation hit 401 `user does not exist
|
||||
# [uid: 0, name: ""]` from Gitea. Pod-level init gating is
|
||||
# the correct waitpoint — Principle #14: HelmRelease.dependsOn
|
||||
# → Kustomization is silently ignored, and the cutover HR is
|
||||
# dormant + disableWait:true so HR-level dependsOn would
|
||||
# resolve Ready=True before Step 09 ever runs. Configurable
|
||||
# via .Values.smeServices.provisioning.waitForCutoverToken.*
|
||||
# (default enabled on Sovereigns; contabo overlay flips
|
||||
# enabled=false because Step 09 never runs on Catalyst-Zero).
|
||||
version: 1.4.214
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-catalyst-platform
|
||||
|
||||
@ -1,5 +1,51 @@
|
||||
apiVersion: v2
|
||||
name: bp-catalyst-platform
|
||||
# 1.4.214 — TBD-V11 / #2002 (2026-05-20): add init container
|
||||
# `wait-for-cutover-token` to the SME provisioning Deployment so the
|
||||
# Pod does NOT accept tenant requests until bp-self-sovereign-cutover
|
||||
# Step 09 (gitea-token-mint) has minted a real Gitea API token and
|
||||
# patched Secret `sme/provisioning-github-token` with the annotation
|
||||
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`.
|
||||
# Pre-fix on t38 (2026-05-19): provisioning Pod started with the
|
||||
# first-install placeholder (gitea-admin-secret.password) under
|
||||
# GITHUB_TOKEN, tenant Org CR creation fired BEFORE Step 09 had
|
||||
# patched, gitea returned `HTTP 401 user does not exist [uid: 0,
|
||||
# name: ""]`, Org CR stuck at Ready=False/GiteaOrgFailed. Step 09's
|
||||
# `rollout restart deploy/provisioning` eventually recovers — but
|
||||
# the FIRST tenant journey after install always 401'd.
|
||||
#
|
||||
# Fix design rationale: HelmRelease.dependsOn between bp-catalyst-
|
||||
# platform and bp-self-sovereign-cutover would NOT solve this — the
|
||||
# cutover HR is dormant (`disableWait: true`) and goes Ready=True at
|
||||
# install time, BEFORE Step 09's Job actually runs. Per Inviolable
|
||||
# Principle #14 we also can't depend on Kustomizations from HRs.
|
||||
# Pod-level init gating is the only correct waitpoint: the init
|
||||
# container kubectl-polls the Secret in `sme` ns for the cutover
|
||||
# annotation, exits 0 once present, and the main provisioning
|
||||
# container starts with a guaranteed-valid token.
|
||||
#
|
||||
# Configurable knobs in values.yaml under
|
||||
# `smeServices.provisioning.waitForCutoverToken`:
|
||||
# enabled (true default; set false on Catalyst-Zero/contabo where
|
||||
# the SealedSecret-provisioned token carries no cutover annotation
|
||||
# and Step 09 never runs)
|
||||
# image (alpine/k8s:1.31.4 — tracks platform/self-sovereign-cutover
|
||||
# values.yaml so platform-wide bumps land in one place)
|
||||
# intervalSeconds (10s default)
|
||||
# timeoutSeconds (1800s = 30m default; covers the slowest cutover
|
||||
# observed on Hetzner cold-start ~18 min with 12 min safety margin)
|
||||
#
|
||||
# RBAC: re-uses the existing `provisioning` ServiceAccount in `sme`,
|
||||
# which already carries `secrets: [get, list, watch]` via the
|
||||
# `sme-provisioning` ClusterRole. No new RBAC.
|
||||
#
|
||||
# Validation: `helm template` shows the initContainers block on
|
||||
# Sovereign installs; smoke render unaffected on contabo (where
|
||||
# `waitForCutoverToken.enabled: false` is the canonical overlay).
|
||||
# On t38 (or fresh prov) the sme/provisioning Pod stays in
|
||||
# `Init:0/1` until Step 09 stamps the annotation, then starts the
|
||||
# main container with the minted token; first tenant Org CR
|
||||
# materialises Ready=True/Materialized without retry.
|
||||
# 1.4.213 — TBD-A68 / #1997 (2026-05-20): bump organization-controller
|
||||
# image pin 72e3f08 → c9b58ea so the chart actually ships PR #1910's
|
||||
# gitea-client fix (POST /api/v1/orgs instead of the admin-only
|
||||
@ -1851,8 +1897,8 @@ name: bp-catalyst-platform
|
||||
# was already shipped on 1.4.197 (PR #1820 lineage); this completes
|
||||
# the data-layer side so the dropdown finally appears on multi-region
|
||||
# Sovereigns. Refs #1821, DoD D20.
|
||||
version: 1.4.213
|
||||
appVersion: 1.4.213
|
||||
version: 1.4.214
|
||||
appVersion: 1.4.214
|
||||
# 1.4.183 — fix(httproute): omit default sectionName so multi-zone
|
||||
# Sovereigns attach via Cilium Gateway hostname matcher (Closes #1884,
|
||||
# TBD-A30). Pre-1.4.183 every catalyst-system HTTPRoute pinned
|
||||
|
||||
@ -147,6 +147,97 @@ spec:
|
||||
serviceAccountName: provisioning
|
||||
imagePullSecrets:
|
||||
- name: ghcr-pull
|
||||
# ── Init container: wait-for-cutover-token (TBD-V11, issue #2002) ──
|
||||
# The provisioning-github-token Secret has TWO lifecycle states:
|
||||
# 1. First-install bootstrap (this chart's
|
||||
# provisioning-github-token.yaml writes the Gitea admin PASSWORD
|
||||
# under GITHUB_TOKEN — enough to boot the Pod past
|
||||
# CreateContainerConfigError but NOT a valid Gitea API token).
|
||||
# 2. Cutover-owned (bp-self-sovereign-cutover Step 09 mints a real
|
||||
# Gitea API token via POST /api/v1/users/{user}/tokens and
|
||||
# patches this Secret + stamps the annotation
|
||||
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`).
|
||||
#
|
||||
# Pre-fix on t38 (2026-05-19): provisioning Pod started at state (1),
|
||||
# tenant Org CR creation fired BEFORE Step 09 had patched the Secret,
|
||||
# gitea API call sent `Authorization: token <admin_password>`, Gitea
|
||||
# 401'd with `user does not exist [uid: 0, name: ""]`, Org CR stuck
|
||||
# at Ready=False/GiteaOrgFailed. Step 09's `rollout restart deploy/
|
||||
# provisioning` recovers eventually, but the Pod accepting traffic
|
||||
# in state (1) means the FIRST tenant request after install always
|
||||
# fails until the rollout completes.
|
||||
#
|
||||
# Fix: gate the provisioning Pod on the Secret carrying the cutover
|
||||
# annotation. Until Step 09 has minted + patched, the init container
|
||||
# polls every {{ .Values.smeServices.provisioning.waitForCutoverToken.intervalSeconds | default 10 }}s
|
||||
# (up to {{ .Values.smeServices.provisioning.waitForCutoverToken.timeoutSeconds | default 1800 }}s)
|
||||
# and only exits 0 once `catalyst.openova.io/token-source` equals
|
||||
# `self-sovereign-cutover-step-09` on Secret
|
||||
# `{{ .Values.smeServices.provisioning.githubToken.secretName | default "provisioning-github-token" }}`
|
||||
# in this namespace. On Catalyst-Zero (contabo) the SealedSecret-
|
||||
# provisioned token carries no cutover annotation — operators set
|
||||
# `.Values.smeServices.provisioning.waitForCutoverToken.enabled: false`
|
||||
# via clusters/contabo-mkt/ overlay so the init container is a no-op
|
||||
# there (defaults to true; Sovereigns inherit the gate).
|
||||
#
|
||||
# Per Inviolable Principle #14: this is Pod-level gating via init
|
||||
# container, NOT HelmRelease.dependsOn → Kustomization (which Flux
|
||||
# silently ignores). The cutover HR is dormant + disableWait:true so
|
||||
# an HR-level dependsOn would resolve Ready=True at install time,
|
||||
# BEFORE Step 09 actually runs — that approach buys nothing.
|
||||
# Pod-level init gating waits on the actual condition (Secret
|
||||
# annotation), not on a proxy.
|
||||
{{- if .Values.smeServices.provisioning.waitForCutoverToken.enabled }}
|
||||
initContainers:
|
||||
- name: wait-for-cutover-token
|
||||
image: {{ .Values.smeServices.provisioning.waitForCutoverToken.image | default "harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4" | quote }}
|
||||
imagePullPolicy: IfNotPresent
|
||||
env:
|
||||
- name: TARGET_NAMESPACE
|
||||
value: {{ .Values.smeServices.provisioning.gitToken.destNamespace | default "sme" | quote }}
|
||||
- name: SECRET_NAME
|
||||
value: {{ .Values.smeServices.provisioning.githubToken.secretName | default "provisioning-github-token" | quote }}
|
||||
- name: TOKEN_SOURCE_ANNOTATION_VALUE
|
||||
value: "self-sovereign-cutover-step-09"
|
||||
- name: INTERVAL_SECONDS
|
||||
value: {{ .Values.smeServices.provisioning.waitForCutoverToken.intervalSeconds | default 10 | quote }}
|
||||
- name: TIMEOUT_SECONDS
|
||||
value: {{ .Values.smeServices.provisioning.waitForCutoverToken.timeoutSeconds | default 1800 | quote }}
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
set -eu
|
||||
echo "[wait-for-cutover-token] gating on Secret ${TARGET_NAMESPACE}/${SECRET_NAME} carrying annotation catalyst.openova.io/token-source=${TOKEN_SOURCE_ANNOTATION_VALUE}"
|
||||
deadline=$(( $(date +%s) + TIMEOUT_SECONDS ))
|
||||
while [ "$(date +%s)" -lt "${deadline}" ]; do
|
||||
# `kubectl get ... -o jsonpath` returns the empty string
|
||||
# (not an error) if the annotation is missing — exactly
|
||||
# the placeholder state we want to wait through.
|
||||
token_source=$(kubectl -n "${TARGET_NAMESPACE}" get secret "${SECRET_NAME}" \
|
||||
-o "jsonpath={.metadata.annotations.catalyst\.openova\.io/token-source}" 2>/dev/null || true)
|
||||
if [ "${token_source}" = "${TOKEN_SOURCE_ANNOTATION_VALUE}" ]; then
|
||||
echo "[wait-for-cutover-token] cutover annotation present — Step 09 has minted the real Gitea API token; proceeding"
|
||||
exit 0
|
||||
fi
|
||||
echo "[wait-for-cutover-token] annotation not yet set (current: '${token_source}'); retrying in ${INTERVAL_SECONDS}s"
|
||||
sleep "${INTERVAL_SECONDS}"
|
||||
done
|
||||
# Timeout is a HARD FAIL — provisioning with the placeholder
|
||||
# admin password as GITHUB_TOKEN cannot satisfy ANY tenant
|
||||
# request. Better to keep the Pod in Init: than to serve 401s.
|
||||
echo "[wait-for-cutover-token] FATAL: timed out after ${TIMEOUT_SECONDS}s waiting for cutover Step 09 to patch ${TARGET_NAMESPACE}/${SECRET_NAME}" >&2
|
||||
exit 1
|
||||
resources:
|
||||
requests: { cpu: 10m, memory: 32Mi }
|
||||
limits: { memory: 128Mi }
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
runAsNonRoot: true
|
||||
runAsUser: 65534
|
||||
readOnlyRootFilesystem: true
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
{{- end }}
|
||||
containers:
|
||||
- name: provisioning
|
||||
image: "{{ if .Values.global.imageRegistry }}{{ .Values.global.imageRegistry }}{{ else }}{{ .Values.images.registry }}{{ end }}/{{ .Values.images.organization }}/services-provisioning:{{ .Values.images.smeTag }}"
|
||||
|
||||
@ -1267,6 +1267,34 @@ smeServices:
|
||||
githubToken:
|
||||
secretName: provisioning-github-token
|
||||
secretKey: GITHUB_TOKEN
|
||||
# waitForCutoverToken: gate the provisioning Pod on bp-self-sovereign-
|
||||
# cutover Step 09 having patched the Secret with a REAL Gitea API
|
||||
# token (TBD-V11, issue #2002). Without this gate, the Pod starts with
|
||||
# the chart's first-install bootstrap value (Gitea admin password,
|
||||
# which Gitea rejects as an API token → 401 "user does not exist")
|
||||
# and the FIRST tenant Org CR creation after install fails. The
|
||||
# init container polls the Secret in the same namespace for the
|
||||
# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`
|
||||
# annotation that Step 09 stamps.
|
||||
#
|
||||
# On Sovereigns: leave enabled (the cutover Job will mint and
|
||||
# patch within minutes of bp-catalyst-platform reconciling).
|
||||
# On Catalyst-Zero (contabo): set enabled=false in
|
||||
# clusters/contabo-mkt/ overlay — the SealedSecret-provisioned token
|
||||
# carries no cutover annotation and Step 09 never runs there.
|
||||
waitForCutoverToken:
|
||||
enabled: true
|
||||
# image: kubectl client used by the init container. Default tracks
|
||||
# platform/self-sovereign-cutover/chart/values.yaml's kubectl image
|
||||
# so a platform-wide bump only needs to land in one place at a time.
|
||||
image: harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4
|
||||
# intervalSeconds + timeoutSeconds: poll cadence + wall-clock cap.
|
||||
# Default 10s × 180 iters = 1800s (30m) — covers the slowest cutover
|
||||
# observed on Hetzner cold-start (~18 min) with 12 min safety margin.
|
||||
# The Pod stays in Init: until Step 09 runs; failing closed is the
|
||||
# correct behaviour because serving 401s blocks every tenant journey.
|
||||
intervalSeconds: 10
|
||||
timeoutSeconds: 1800
|
||||
# git.{apiURL,owner,repo,branch}: Git host coordinates. The
|
||||
# provisioning binary uses GITHUB_API_URL when non-empty (Sovereign
|
||||
# path → in-cluster Gitea REST API) and otherwise falls back to the
|
||||
|
||||
Loading…
Reference in New Issue
Block a user