TBD-P4 B4 — env-var name drift between the sandbox-controller and the
MCP plugin silently degraded every MCP tool family to "not configured"
at runtime. The controller emitted bare `ORG_ID` and `SOVEREIGN_FQDN`
on every rendered MCP Deployment while the MCP binary
(products/sandbox/mcp-server/internal/tools/env.go) reads the
namespaced canonical `SANDBOX_ORG_ID` / `SANDBOX_SOVEREIGN_FQDN`. Per
agent a99ea3aa's investigation, six additional env-var families the
MCP requires were never wired at all.
Surgical alignment across renderer + chart + controller wiring:
1. core/controllers/sandbox/internal/gitops/manifests.go — MCP
Deployment template renamed the bare names AND grew env entries
for the canonical set the MCP plugin reads:
Rename (MCP Deployment only; pty-server StatefulSet keeps the bare
names since they are inherited into the user's agent shell — that
is a distinct contract):
ORG_ID -> SANDBOX_ORG_ID (tool family: all)
SOVEREIGN_FQDN -> SANDBOX_SOVEREIGN_FQDN (tool family: all)
Added (the MCP plugin was reading them; controller wasn't emitting):
SANDBOX_ID -> identifies the Sandbox CR
SANDBOX_NAMESPACE -> rendered ns sandbox-<owner-uid>
SANDBOX_TENANT_ID -> scopes marketplace/byod handler
SANDBOX_GITEA_BASE_URL -> sandbox.deploy / gitea tool family
SANDBOX_GITEA_TOKEN (secret) -> ditto, via secretKeyRef optional
SANDBOX_DOMAIN_API_URL -> marketplace tool family
SANDBOX_MARKETPLACE_API_URL -> marketplace tool family
SANDBOX_STORAGE_S3_ENDPOINT -> sandbox.storage tool family
SANDBOX_STORAGE_S3_REGION -> ditto
SANDBOX_STORAGE_S3_USE_TLS -> ditto
SANDBOX_STORAGE_S3_ACCESS_KEY -> ditto, via secretKeyRef optional
SANDBOX_STORAGE_S3_SECRET_KEY -> ditto, via secretKeyRef optional
KEYCLOAK_ADMIN_URL -> sandbox.auth tool family
KEYCLOAK_PARENT_REALM -> ditto
KEYCLOAK_ADMIN_TOKEN (secret) -> ditto, via secretKeyRef optional
2. platform/sandbox/chart — bp-sandbox HR surfaces the new wiring as
chart-level values (mcp.giteaBaseURL, mcp.domainAPIURL,
mcp.storage.*, mcp.keycloak.*) defaulting to the in-cluster Service
DNS of a stock Sovereign install. Per-Sovereign overlays may
override any value. Secrets are NEVER written from this chart —
name+key references only with `optional: true` so a fresh-prov
Sovereign with a credential source in flight does NOT crash the
per-Sandbox MCP Pod; the affected tool family surfaces a clean
"not configured" error at call time (matches the MCP plugin's
existing per-tool guard pattern).
3. Chart.yaml + bootstrap-kit pin (19a-bp-sandbox.yaml) bumped to
0.2.0 so the per-Sovereign overlay picks up the new env surface
on the next reconcile.
4. sandbox_controller_test.go — extended deployment-mcp.yaml assertion
block to assert the canonical SANDBOX_* env-var set + value
plumbing AND added a negative assertion that the bare `ORG_ID` /
`SOVEREIGN_FQDN` names MUST NOT appear on the MCP Deployment
(they remain on the pty-server StatefulSet, distinct contract).
Regression test against future re-introduction of the drift.
Validation:
- go test ./sandbox/... — all green (controller / gitops / idlescaler
/ newapi / sandboxapi).
- helm template platform/sandbox/chart --set enabled=true ... — clean
render, 16 SANDBOX_MCP_* env vars emitted on the controller
Deployment.
Hard rules honoured:
- READ-ONLY against existing cluster (no kubectl writes).
- No Secret writes — name+key references only, all `optional: true`.
- emrah.baysal mailbox + Stalwart admin untouched.
- Principle #12 fresh clone validation.
Refs #1986
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
152 lines
6.9 KiB
YAML
152 lines
6.9 KiB
YAML
# bp-sandbox — Catalyst bootstrap-kit Blueprint slot 19a (post-harbor).
|
|
#
|
|
# Deploys the sandbox-controller (Wave 1 + Wave 8) on a Sovereign so
|
|
# that `sandbox.openova.io/v1.Sandbox` CRs are actually reconciled.
|
|
# Wave 8 extends the controller to ALSO render per-Sandbox pty-server
|
|
# StatefulSet + MCP Deployment + Service + HTTPRoute (architecture.md
|
|
# §7) — without this slot enabled, every Sandbox CR sits unreconciled.
|
|
#
|
|
# ─── Slot history: 61 → 19a (Wave 11 convergence fix, 2026-05-18) ────
|
|
# Originally slot 61. Caught live on t16.omantel.biz: bp-sandbox HR
|
|
# stuck Reconciling because its chart pull went through
|
|
# harbor.<sov-fqdn> (bp-self-sovereign-cutover Step-06 phase-1 rewrites
|
|
# every HelmRepository URL `oci://ghcr.io/openova-io` →
|
|
# `oci://harbor.<sov-fqdn>/openova-io` after handover), but harbor.<sov
|
|
# -fqdn> wasn't reachable yet because bp-harbor itself hadn't reached
|
|
# Ready — chicken-and-egg. Same failure shape as Wave 7 #1610 with
|
|
# bp-hcloud-csi (REMOVED — see kustomization.yaml comment block).
|
|
#
|
|
# Fix here is the cleaner long-term cousin of the Wave 7 hotfix:
|
|
# instead of removing the slot, sequence it AFTER bp-harbor (slot 19)
|
|
# by renumbering to 19a + adding `bp-harbor` to dependsOn. Once
|
|
# bp-harbor is Ready (its chart pull goes through harbor.openova.io,
|
|
# the mothership-warmed proxy-cache wired into k3s registries.yaml at
|
|
# cloud-init time — NOT through harbor.<sov-fqdn>, so no cycle there),
|
|
# this slot's chart pull can resolve against either ghcr.io
|
|
# (pre-cutover) or harbor.<sov-fqdn> (post-cutover) and find the
|
|
# artifact. The cutover Step-06 phase-1 URL rewrite is safe by then.
|
|
|
|
---
|
|
apiVersion: source.toolkit.fluxcd.io/v1beta2
|
|
kind: HelmRepository
|
|
metadata:
|
|
name: bp-sandbox
|
|
namespace: flux-system
|
|
spec:
|
|
type: oci
|
|
interval: 15m
|
|
url: oci://ghcr.io/openova-io
|
|
secretRef:
|
|
name: ghcr-pull
|
|
---
|
|
apiVersion: helm.toolkit.fluxcd.io/v2
|
|
kind: HelmRelease
|
|
metadata:
|
|
name: bp-sandbox
|
|
namespace: flux-system
|
|
labels:
|
|
catalyst.openova.io/slot: "19a"
|
|
catalyst.openova.io/component: sandbox-controller
|
|
spec:
|
|
interval: 15m
|
|
releaseName: sandbox
|
|
targetNamespace: catalyst-system
|
|
dependsOn:
|
|
- name: bp-vcluster-helmrepo
|
|
- name: bp-catalyst-platform
|
|
# bp-harbor (slot 19, Wave 11 convergence fix 2026-05-18) — sandbox's
|
|
# chart pull goes through harbor.<sov-fqdn> after the post-handover
|
|
# cutover Step-06 phase-1 HelmRepository URL rewrite. Without this
|
|
# edge, source-controller hits harbor.<sov-fqdn> before bp-harbor
|
|
# is Ready, the OCI fetch 503s, and bp-sandbox sits Reconciling for
|
|
# the entire bootstrap-kit timeout window — preventing the umbrella
|
|
# Kustomization from ever reaching Ready. Same chicken-and-egg as
|
|
# Wave 7 #1610 (bp-hcloud-csi, REMOVED) but resolved by sequencing
|
|
# rather than removal so the slot remains available for Wave 11
|
|
# Sandbox MVP without manual Day-2 add-app re-introduction.
|
|
- name: bp-harbor
|
|
chart:
|
|
spec:
|
|
chart: sandbox
|
|
version: 0.2.0
|
|
sourceRef:
|
|
kind: HelmRepository
|
|
name: bp-sandbox
|
|
namespace: flux-system
|
|
install:
|
|
timeout: 10m
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|
|
upgrade:
|
|
timeout: 10m
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|
|
# Per-Sovereign overlay surface.
|
|
#
|
|
# enabled — default-ON via ${SANDBOX_ENABLED:-true} on the
|
|
# bootstrap-kit Kustomization substitute. Wave 11 convergence fix
|
|
# (TBD-D11, t22.omantel.biz 2026-05-18): every Sandbox CR sat
|
|
# unreconciled because the bootstrap-kit Kustomization's substitute
|
|
# map never wires SANDBOX_ENABLED, so the envsubst resolved to the
|
|
# `:-false` fallback and the chart skip-rendered the entire
|
|
# controller Deployment. With Wave 8 pty-server + MCP images now
|
|
# SHA-stamped in chart values.yaml (auto-bumped by .github/workflows/
|
|
# build-sandbox-{pty-server,mcp-server}.yaml), the gate's original
|
|
# purpose is satisfied — flip default-ON so the controller materialises
|
|
# on every fresh prov. Operators may still opt-OUT by setting
|
|
# `SANDBOX_ENABLED=false` on the per-Sovereign overlay's substitute
|
|
# map (mirrors how MARKETPLACE_ENABLED works in slot 13).
|
|
#
|
|
# runtime.* — Wave 8 pty-server / MCP / NEWAPI wiring. The
|
|
# controller surfaces these to its per-Sandbox renderer (manifests
|
|
# rendered into the per-Org `catalyst-tenant` Gitea repo at
|
|
# sandbox/<owner-uid>/).
|
|
#
|
|
# Image overrides are OMITTED from this slot's HR values — the
|
|
# chart's values.yaml already SHA-pins both images (auto-bumped by
|
|
# CI) and exposing them as substitute vars without the corresponding
|
|
# entries in the bootstrap-kit Kustomization postBuild.substitute
|
|
# map causes Flux to substitute empty strings → null → the chart's
|
|
# `required` guard would fail render once enabled=true. Day-2 SHA
|
|
# overrides remain available via Sovereign-overlay HelmRelease
|
|
# patches under spec.values.runtime.{ptyServerImage,mcpImage} — but
|
|
# the canonical path is bumping chart values.yaml + bootstrap-kit
|
|
# pin (single source of truth, INVIOLABLE-PRINCIPLES.md #4a).
|
|
values:
|
|
enabled: ${SANDBOX_ENABLED:-true}
|
|
env:
|
|
hostCluster: ${SOVEREIGN_REGION_CANONICAL_LABEL}
|
|
sovereignFQDN: ${SOVEREIGN_FQDN}
|
|
# TBD-D35c (Wave 32 verifier fix) — comma-separated list of
|
|
# NewAPI channel names the controller stamps as `allowed_channels`
|
|
# on every per-Sandbox token mint. Default `qwen` matches the
|
|
# only channel bp-newapi's channel-seed-job.yaml writes on a
|
|
# fresh Sovereign install (alias for `qwen3.6-bankdhofar`,
|
|
# products/sandbox/docs/newapi-proxy-contract.md §2). Per-
|
|
# Sovereign overlays MUST extend this list to mirror their
|
|
# channel rollout (e.g. `qwen,anthropic,openai`) — the chart's
|
|
# NoAllowedChannels guard fails every mint if this resolves to
|
|
# empty.
|
|
newapiDefaultChannels: ${SANDBOX_DEFAULT_CHANNELS:-qwen}
|
|
runtime:
|
|
newapiURL: https://newapi.${SOVEREIGN_FQDN}/v1
|
|
# D31 active-hot-standby — when SOVEREIGN_ENABLE_HOT_STANDBY=true on
|
|
# the per-Sovereign overlay (and both regions are non-empty AND
|
|
# distinct), sandbox.db.provision materialises a primary + replica
|
|
# Cluster.postgresql.cnpg.io pair instead of a single Cluster
|
|
# (mirrors the bp-cnpg-pair pattern + bp-wordpress-tenant chart
|
|
# 0.2.0+). Same trio of envsubst placeholders bp-catalyst-platform
|
|
# slot 13 consumes for the marketplace tenant path — flipping one
|
|
# knob on the per-Sovereign overlay covers BOTH paths so HA stays
|
|
# consistent across the marketplace tenant install and the
|
|
# sandbox.db plane. Default empty = single-Cluster CNPG (zero
|
|
# regression). Region keys MUST match the canonical openova.io/
|
|
# region node label value (e.g. `hz-fsn-rtz-prod`).
|
|
cnpg:
|
|
activeHotStandby:
|
|
enabled: ${SOVEREIGN_ENABLE_HOT_STANDBY:-}
|
|
primaryRegion: ${SOVEREIGN_PRIMARY_REGION:-}
|
|
replicaRegion: ${SOVEREIGN_REPLICA_REGION:-}
|