openova/clusters/_template/bootstrap-kit/19a-bp-sandbox.yaml
hatiyildiz 88453dc4c2 fix(sandbox-controller): emit canonical SANDBOX_* env vars for MCP plugin (Refs #1986)
TBD-P4 B4 — env-var name drift between the sandbox-controller and the
MCP plugin silently degraded every MCP tool family to "not configured"
at runtime. The controller emitted bare `ORG_ID` and `SOVEREIGN_FQDN`
on every rendered MCP Deployment while the MCP binary
(products/sandbox/mcp-server/internal/tools/env.go) reads the
namespaced canonical `SANDBOX_ORG_ID` / `SANDBOX_SOVEREIGN_FQDN`. Per
agent a99ea3aa's investigation, six additional env-var families the
MCP requires were never wired at all.

Surgical alignment across renderer + chart + controller wiring:

1. core/controllers/sandbox/internal/gitops/manifests.go — MCP
   Deployment template renamed the bare names AND grew env entries
   for the canonical set the MCP plugin reads:

   Rename (MCP Deployment only; pty-server StatefulSet keeps the bare
   names since they are inherited into the user's agent shell — that
   is a distinct contract):
     ORG_ID         -> SANDBOX_ORG_ID            (tool family: all)
     SOVEREIGN_FQDN -> SANDBOX_SOVEREIGN_FQDN    (tool family: all)

   Added (the MCP plugin was reading them; controller wasn't emitting):
     SANDBOX_ID                    -> identifies the Sandbox CR
     SANDBOX_NAMESPACE             -> rendered ns sandbox-<owner-uid>
     SANDBOX_TENANT_ID             -> scopes marketplace/byod handler
     SANDBOX_GITEA_BASE_URL        -> sandbox.deploy / gitea tool family
     SANDBOX_GITEA_TOKEN (secret)  -> ditto, via secretKeyRef optional
     SANDBOX_DOMAIN_API_URL        -> marketplace tool family
     SANDBOX_MARKETPLACE_API_URL   -> marketplace tool family
     SANDBOX_STORAGE_S3_ENDPOINT   -> sandbox.storage tool family
     SANDBOX_STORAGE_S3_REGION     -> ditto
     SANDBOX_STORAGE_S3_USE_TLS    -> ditto
     SANDBOX_STORAGE_S3_ACCESS_KEY -> ditto, via secretKeyRef optional
     SANDBOX_STORAGE_S3_SECRET_KEY -> ditto, via secretKeyRef optional
     KEYCLOAK_ADMIN_URL            -> sandbox.auth tool family
     KEYCLOAK_PARENT_REALM         -> ditto
     KEYCLOAK_ADMIN_TOKEN (secret) -> ditto, via secretKeyRef optional

2. platform/sandbox/chart — bp-sandbox HR surfaces the new wiring as
   chart-level values (mcp.giteaBaseURL, mcp.domainAPIURL,
   mcp.storage.*, mcp.keycloak.*) defaulting to the in-cluster Service
   DNS of a stock Sovereign install. Per-Sovereign overlays may
   override any value. Secrets are NEVER written from this chart —
   name+key references only with `optional: true` so a fresh-prov
   Sovereign with a credential source in flight does NOT crash the
   per-Sandbox MCP Pod; the affected tool family surfaces a clean
   "not configured" error at call time (matches the MCP plugin's
   existing per-tool guard pattern).

3. Chart.yaml + bootstrap-kit pin (19a-bp-sandbox.yaml) bumped to
   0.2.0 so the per-Sovereign overlay picks up the new env surface
   on the next reconcile.

4. sandbox_controller_test.go — extended deployment-mcp.yaml assertion
   block to assert the canonical SANDBOX_* env-var set + value
   plumbing AND added a negative assertion that the bare `ORG_ID` /
   `SOVEREIGN_FQDN` names MUST NOT appear on the MCP Deployment
   (they remain on the pty-server StatefulSet, distinct contract).
   Regression test against future re-introduction of the drift.

Validation:
 - go test ./sandbox/... — all green (controller / gitops / idlescaler
   / newapi / sandboxapi).
 - helm template platform/sandbox/chart --set enabled=true ... — clean
   render, 16 SANDBOX_MCP_* env vars emitted on the controller
   Deployment.

Hard rules honoured:
 - READ-ONLY against existing cluster (no kubectl writes).
 - No Secret writes — name+key references only, all `optional: true`.
 - emrah.baysal mailbox + Stalwart admin untouched.
 - Principle #12 fresh clone validation.

Refs #1986

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:00:57 +02:00

152 lines
6.9 KiB
YAML

# bp-sandbox — Catalyst bootstrap-kit Blueprint slot 19a (post-harbor).
#
# Deploys the sandbox-controller (Wave 1 + Wave 8) on a Sovereign so
# that `sandbox.openova.io/v1.Sandbox` CRs are actually reconciled.
# Wave 8 extends the controller to ALSO render per-Sandbox pty-server
# StatefulSet + MCP Deployment + Service + HTTPRoute (architecture.md
# §7) — without this slot enabled, every Sandbox CR sits unreconciled.
#
# ─── Slot history: 61 → 19a (Wave 11 convergence fix, 2026-05-18) ────
# Originally slot 61. Caught live on t16.omantel.biz: bp-sandbox HR
# stuck Reconciling because its chart pull went through
# harbor.<sov-fqdn> (bp-self-sovereign-cutover Step-06 phase-1 rewrites
# every HelmRepository URL `oci://ghcr.io/openova-io` →
# `oci://harbor.<sov-fqdn>/openova-io` after handover), but harbor.<sov
# -fqdn> wasn't reachable yet because bp-harbor itself hadn't reached
# Ready — chicken-and-egg. Same failure shape as Wave 7 #1610 with
# bp-hcloud-csi (REMOVED — see kustomization.yaml comment block).
#
# Fix here is the cleaner long-term cousin of the Wave 7 hotfix:
# instead of removing the slot, sequence it AFTER bp-harbor (slot 19)
# by renumbering to 19a + adding `bp-harbor` to dependsOn. Once
# bp-harbor is Ready (its chart pull goes through harbor.openova.io,
# the mothership-warmed proxy-cache wired into k3s registries.yaml at
# cloud-init time — NOT through harbor.<sov-fqdn>, so no cycle there),
# this slot's chart pull can resolve against either ghcr.io
# (pre-cutover) or harbor.<sov-fqdn> (post-cutover) and find the
# artifact. The cutover Step-06 phase-1 URL rewrite is safe by then.
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-sandbox
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-sandbox
namespace: flux-system
labels:
catalyst.openova.io/slot: "19a"
catalyst.openova.io/component: sandbox-controller
spec:
interval: 15m
releaseName: sandbox
targetNamespace: catalyst-system
dependsOn:
- name: bp-vcluster-helmrepo
- name: bp-catalyst-platform
# bp-harbor (slot 19, Wave 11 convergence fix 2026-05-18) — sandbox's
# chart pull goes through harbor.<sov-fqdn> after the post-handover
# cutover Step-06 phase-1 HelmRepository URL rewrite. Without this
# edge, source-controller hits harbor.<sov-fqdn> before bp-harbor
# is Ready, the OCI fetch 503s, and bp-sandbox sits Reconciling for
# the entire bootstrap-kit timeout window — preventing the umbrella
# Kustomization from ever reaching Ready. Same chicken-and-egg as
# Wave 7 #1610 (bp-hcloud-csi, REMOVED) but resolved by sequencing
# rather than removal so the slot remains available for Wave 11
# Sandbox MVP without manual Day-2 add-app re-introduction.
- name: bp-harbor
chart:
spec:
chart: sandbox
version: 0.2.0
sourceRef:
kind: HelmRepository
name: bp-sandbox
namespace: flux-system
install:
timeout: 10m
disableWait: true
remediation:
retries: 3
upgrade:
timeout: 10m
disableWait: true
remediation:
retries: 3
# Per-Sovereign overlay surface.
#
# enabled — default-ON via ${SANDBOX_ENABLED:-true} on the
# bootstrap-kit Kustomization substitute. Wave 11 convergence fix
# (TBD-D11, t22.omantel.biz 2026-05-18): every Sandbox CR sat
# unreconciled because the bootstrap-kit Kustomization's substitute
# map never wires SANDBOX_ENABLED, so the envsubst resolved to the
# `:-false` fallback and the chart skip-rendered the entire
# controller Deployment. With Wave 8 pty-server + MCP images now
# SHA-stamped in chart values.yaml (auto-bumped by .github/workflows/
# build-sandbox-{pty-server,mcp-server}.yaml), the gate's original
# purpose is satisfied — flip default-ON so the controller materialises
# on every fresh prov. Operators may still opt-OUT by setting
# `SANDBOX_ENABLED=false` on the per-Sovereign overlay's substitute
# map (mirrors how MARKETPLACE_ENABLED works in slot 13).
#
# runtime.* — Wave 8 pty-server / MCP / NEWAPI wiring. The
# controller surfaces these to its per-Sandbox renderer (manifests
# rendered into the per-Org `catalyst-tenant` Gitea repo at
# sandbox/<owner-uid>/).
#
# Image overrides are OMITTED from this slot's HR values — the
# chart's values.yaml already SHA-pins both images (auto-bumped by
# CI) and exposing them as substitute vars without the corresponding
# entries in the bootstrap-kit Kustomization postBuild.substitute
# map causes Flux to substitute empty strings → null → the chart's
# `required` guard would fail render once enabled=true. Day-2 SHA
# overrides remain available via Sovereign-overlay HelmRelease
# patches under spec.values.runtime.{ptyServerImage,mcpImage} — but
# the canonical path is bumping chart values.yaml + bootstrap-kit
# pin (single source of truth, INVIOLABLE-PRINCIPLES.md #4a).
values:
enabled: ${SANDBOX_ENABLED:-true}
env:
hostCluster: ${SOVEREIGN_REGION_CANONICAL_LABEL}
sovereignFQDN: ${SOVEREIGN_FQDN}
# TBD-D35c (Wave 32 verifier fix) — comma-separated list of
# NewAPI channel names the controller stamps as `allowed_channels`
# on every per-Sandbox token mint. Default `qwen` matches the
# only channel bp-newapi's channel-seed-job.yaml writes on a
# fresh Sovereign install (alias for `qwen3.6-bankdhofar`,
# products/sandbox/docs/newapi-proxy-contract.md §2). Per-
# Sovereign overlays MUST extend this list to mirror their
# channel rollout (e.g. `qwen,anthropic,openai`) — the chart's
# NoAllowedChannels guard fails every mint if this resolves to
# empty.
newapiDefaultChannels: ${SANDBOX_DEFAULT_CHANNELS:-qwen}
runtime:
newapiURL: https://newapi.${SOVEREIGN_FQDN}/v1
# D31 active-hot-standby — when SOVEREIGN_ENABLE_HOT_STANDBY=true on
# the per-Sovereign overlay (and both regions are non-empty AND
# distinct), sandbox.db.provision materialises a primary + replica
# Cluster.postgresql.cnpg.io pair instead of a single Cluster
# (mirrors the bp-cnpg-pair pattern + bp-wordpress-tenant chart
# 0.2.0+). Same trio of envsubst placeholders bp-catalyst-platform
# slot 13 consumes for the marketplace tenant path — flipping one
# knob on the per-Sovereign overlay covers BOTH paths so HA stays
# consistent across the marketplace tenant install and the
# sandbox.db plane. Default empty = single-Cluster CNPG (zero
# regression). Region keys MUST match the canonical openova.io/
# region node label value (e.g. `hz-fsn-rtz-prod`).
cnpg:
activeHotStandby:
enabled: ${SOVEREIGN_ENABLE_HOT_STANDBY:-}
primaryRegion: ${SOVEREIGN_PRIMARY_REGION:-}
replicaRegion: ${SOVEREIGN_REPLICA_REGION:-}