openova/clusters/_template/bootstrap-kit/80-newapi.yaml
hatiyildiz 49bc87c200 fix(catalyst-bootstrap-api): wire CATALYST_NEWAPI_ADMIN_TOKEN + correct CATALYST_NEWAPI_ADDR (Refs #2021)
Bundles the two halves of the broken ADR-0003 §3.2 NewAPI admin-API
hook so the path goes from dormant-and-misconfigured to actually live:

1. catalyst-api Deployment (bp-catalyst-platform) now sets:
   - CATALYST_NEWAPI_ADDR = "http://newapi-bp-newapi.newapi.svc.cluster.local:3000"
     (literal — dual-mode Helm+Kustomize contract)
   - CATALYST_NEWAPI_ADMIN_TOKEN via secretKeyRef on
     `catalyst-newapi-admin-token` key ADMIN_API_TOKEN (optional:true)

2. bp-newapi ExternalSecret target now carries emberstack/reflector
   mirror annotations (default reflector-allowed-namespaces =
   "catalyst-system") so the Secret rendered in the `newapi`
   namespace is materialised in the catalyst-api Pod's namespace
   (same cross-namespace seam as sme-secrets / catalyst-gitea-token).

3. main.go default URL fallback corrected from the NXDOMAIN
   `http://newapi.newapi.svc` to the canonical Service URL
   `http://newapi-bp-newapi.newapi.svc.cluster.local:3000` (same
   root cause as TBD-V14 / PR #2017: bp-newapi.fullname renders
   `<Release.Name>-<Chart.Name>` and bootstrap-kit slot 80 sets
   `releaseName: newapi` against chart `bp-newapi`).

4. newapi/client.go godoc + main.go comments updated to the
   correct Service URL.

Chart lockstep (Inviolable Principle #14):
  - bp-newapi             1.4.32  -> 1.4.33
  - bp-catalyst-platform  1.4.224 -> 1.4.225
  - bootstrap-kit pins both in lockstep.

Validation:
  - go test ./internal/newapi/... ./internal/handler/... PASS
  - go build ./cmd/api/                                   PASS
  - helm template products/catalyst/chart/ renders
    CATALYST_NEWAPI_ADDR=http://newapi-bp-newapi.newapi.svc.cluster.local:3000
    + CATALYST_NEWAPI_ADMIN_TOKEN secretKeyRef on
    catalyst-newapi-admin-token/ADMIN_API_TOKEN.
  - kubectl kustomize products/catalyst/chart/templates/ renders
    the same env vars (dual-mode contract preserved).
  - helm template platform/newapi/chart/ -s templates/external-secret.yaml
    --api-versions=external-secrets.io/v1beta1 renders the
    reflector annotations on target.template.metadata.annotations.

Per CLAUDE.md §0 anti-theater discipline this PR uses Refs #2021
(NOT Closes). Issue closes only after a fresh-prov operator walks
/console/sme/users -> Add User and observes
`sme-users: NewAPI admin client wired` at catalyst-api startup +
the row transitions to state=newapi_created (no
`newapi client not wired` sentinel, no NXDOMAIN for
`newapi.newapi.svc`).

Refs #2021

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:59:48 +02:00

340 lines
17 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# bp-newapi — Catalyst Application Blueprint, bootstrap-kit slot 80.
# Multi-tenant LLM marketplace gateway. Ships in backend-only mode: the
# OpenAI-compatible API at api.<sovereign-fqdn>/v1/* is customer-facing;
# the upstream's portal UI is disabled at ingress; Catalyst replaces it
# as the customer surface; NewAPI's admin UI at admin.<sovereign-fqdn>
# is exposed only to ops staff (Keycloak-gated).
#
# This slot enables the SME-tenant turnkey experience (epic #795). The
# Catalyst signup hook (delivered by unified-rbac in #802 against the
# contract recorded in ADR-0003) reads the `catalyst-newapi-admin-token`
# Secret rendered by this chart's ExternalSecret to issue per-user API
# keys against NewAPI's admin API at
# `http://newapi-bp-newapi.newapi.svc.cluster.local:3000` (canonical
# in-cluster Service URL — the bp-newapi `<Release.Name>-<Chart.Name>`
# helper renders `newapi-bp-newapi` for `releaseName: newapi` against
# chart `bp-newapi`; pre-TBD-V15 / #2021 this comment cited the
# wrong bare-`newapi` Service name).
#
# Wrapper chart: platform/newapi/chart/
# Catalyst-curated values: platform/newapi/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
---
apiVersion: v1
kind: Namespace
metadata:
name: newapi
labels:
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-newapi
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-newapi
namespace: flux-system
labels:
catalyst.openova.io/slot: "80"
spec:
interval: 15m
releaseName: newapi
targetNamespace: newapi
# bp-newapi depends on:
# - bp-openbao(08): the secret backend the chart's ExternalSecret
# pulls `ADMIN_API_TOKEN` from. Without OpenBao Ready, the
# ExternalSecret never resolves and the Catalyst signup hook can't
# reach the NewAPI admin API.
# - bp-keycloak(09): the OIDC issuer for the ops-staff admin UI at
# admin.<sovereign-fqdn>. Without Keycloak Ready, the OIDC
# middleware can't redirect ops-staff requests.
# - bp-cnpg(16): operator provisions the Postgres cluster for users,
# credits, channels, and audit log via a Crossplane
# PostgresqlInstance claim once cnpg is Ready. The DSN is mounted
# into NewAPI via `database.existingSecret` (operator-set).
dependsOn:
- name: bp-openbao
- name: bp-keycloak
- name: bp-cnpg
chart:
spec:
chart: bp-newapi
# 1.4.0 (issue #943, 2026-05-05): auto-provision CNPG-backed
# Postgres + chart-emitted SESSION_SECRET/CRYPTO_SECRET so a
# Sovereign install lands a real Pod without operator intervention.
# Pre-#943 the Deployment silently skipped render whenever
# database.existingSecret OR credentials.existingSecret was
# empty (the bootstrap-kit overlay supplies neither), so NewAPI
# never came up and alice signup gate 5 (LLM) timed out. Both
# auto-provisions are capability-gated on bp-cnpg's CRD and
# operator-overridable per Inviolable Principle #4.
# 1.3.0: defaultChannels.qwenBankDhofar (channel #1 = Qwen3.6 @
# https://llm-api.omtd.bankdhofar.com) + post-install/post-upgrade
# `channel-seed` Helm hook Job that idempotently POSTs default
# channels into NewAPI's admin API. Issue #915 (epic SME tenant
# integration DoD: alice → OpenClaw → NewAPI → Qwen3.6@BankDhofar
# end-to-end).
# 1.2.0: Traefik Middleware gated behind ingress.middleware.enabled.
# 1.4.1 (issue #952, 2026-05-05): Pod imagePullSecrets templated +
# default to `[{name: ghcr-pull}]` so kubelet authenticates pulls
# of the PRIVATE newapi-mirror + metering-sidecar images. Paired
# with cloud-init adding `newapi` to flux-system/ghcr-pull's
# reflector auto-namespaces list.
# 1.4.2 (qa-loop bounded-cycle audit prov #7 Gap F, 2026-05-10):
# `.Values.newapi.image.tag` repointed from `v0.4.5` (fictitious —
# never built by any CI workflow) to `v0.13.2` (actual upstream
# Calcium-Ion/new-api Docker Hub release, mirrored into
# ghcr.io/openova-io/openova/newapi-mirror by the new
# `.github/workflows/build-bp-newapi.yaml` workflow). Pre-1.4.2
# the NewAPI Pod ImagePullBackOff'd 403 on every fresh Sovereign,
# blocking alice signup gate 5 (LLM).
# 1.4.4 (qa-loop bounded-cycle audit prov #20 Fix #138, 2026-05-11):
# add pre-install/pre-upgrade hook that polls the external-secrets
# validating-admission webhook until it returns a structured HTTP
# response — closes the race between bp-external-secrets reaching
# HR Ready=True and the apiserver-side EndpointSlice for the
# webhook Service being observable. Pre-1.4.4 the chart's
# ExternalSecret apply was rejected with `no endpoints available
# for service "external-secrets-webhook"` on every fresh provision,
# blocking the chart from reaching Ready and the Catalyst signup
# hook (ADR-0003 §3.2) from finding the admin-token Secret.
# 1.4.10 (fix-convergence-wave11, 2026-05-18): gate the
# defaultChannels.qwenBankDhofar entry on attestation-complete
# rather than hard-failing the helm template. Pre-1.4.10 the
# chart raised `commercial-contract attestation requires accountId`
# on every Sovereign that opted in to marketplace
# (MARKETPLACE_ENABLED=true) without ALSO supplying a signed
# commercial contract's `LLM_BANK_DHOFAR_ACCOUNT_ID` /
# `LLM_BANK_DHOFAR_CONTRACT_REF` envsubst variables. Post-1.4.10
# the chart silently skips the qwenBankDhofar channel when
# attestation is incomplete; once the operator overlay supplies
# the attestation values the channel composes on the next
# reconcile.
# 1.4.12 (PR #1677, 2026-05-18): default
# `.Values.sandboxTokenSigningKey.reflectorNamespaces` flipped
# from `"sandbox"` → `"catalyst-system,sandbox"`. Pre-1.4.12 the
# chart-emitted `newapi-bp-newapi-token-signing-key` Secret was
# mirrored only into a `sandbox` namespace (which does NOT exist
# on a stock Sovereign — bp-sandbox installs into
# `catalyst-system` per slot 19a `targetNamespace`); the sandbox-
# controller's `NEWAPI_ADMIN_SECRET` env var (secretKeyRef
# `optional: true`) landed EMPTY, the controller silently dropped
# into gitops-only mode, and zero per-Sandbox LLM-gateway tokens
# were ever minted (operator-visible only via the controller's
# `newapi_admin_secret_set=false` startup log). Caught on t22
# 2026-05-18 (TBD-D14). Bumping the pin pulls the post-#1677
# default so reflector mirrors into `catalyst-system` too.
# 1.4.14 (current main, 2026-05-18): latest upstream-tracking
# chart cut — includes 1.4.12's reflector fix.
# 1.4.19 (TBD-A12 #1798, 2026-05-18): add startupProbe so kubelet
# does NOT SIGKILL the binary at the 50s mark while GORM
# AutoMigrate is still in-flight on the freshly-provisioned empty
# `newapi` CNPG database. Pre-1.4.19 the empty DB on t22 sat with
# ZERO tables after 29 CrashLoopBackOff restarts — every kill
# raced AutoMigrate's first CREATE TABLE call mid-TLS-handshake;
# pg_stat_activity on the CNPG primary showed no `newapi` user
# connections because the kill happened before the GORM
# connection pool's first wire write completed. Probe budget:
# 30 × 10s = 5 min, comfortably above the observed 60-120s
# ceiling on cpx21/cpx31 nodes with sslmode=require.
# TBD-A39 #1834 (2026-05-19): bp-newapi 1.4.27 replaces the
# Helm-`lookup`-based DSN Secret render (which raced CNPG on
# first install and committed an empty password — t32 newapi
# Pod was 21x CrashLoopBackOff with `password authentication
# failed for user "newapi"`) with a post-install Job that polls
# `<cluster>-app` and PATCHes the SQL_DSN bytes. Canonical
# database-secret-sync-job pattern lifted from
# platform/gitea/chart/templates/database-secret-sync-job.yaml
# (issue #830 Bug 2) + platform/wordpress-tenant/chart/templates/
# database-secret-sync-job.yaml (issue #1786).
# 1.4.29 (TBD-A52 #1944): default Valkey URL was
# `valkey.valkey.svc.cluster.local` which is NXDOMAIN — the
# bp-valkey bitnami chart with architecture=replication exposes
# `valkey-primary` / `valkey-replicas` / `valkey-headless`, not a
# plain `valkey` Service. Caused 31× CrashLoopBackOff on t34.
# bp-newapi 1.4.29 ships the corrected
# `valkey-primary.valkey.svc.cluster.local` default.
# 1.4.31 (TBD-V21 #2032, 2026-05-20): extend default
# `sandboxTokenSigningKey.reflectorNamespaces` to include the
# `sandbox-.*` regex pattern so emberstack/reflector mirrors the
# SIGNING_KEY Secret into every per-Sandbox namespace. Paired with
# bp-sandbox 0.3.2 which mounts SIGNING_KEY as the MCP's
# `SANDBOX_JWT_SECRET` env (closes auth-gate-stays-in-test-mode
# silent-breakage).
# 1.4.33 (TBD-V15 #2021, 2026-05-20): catalyst-newapi-admin-token
# ExternalSecret target now carries reflector mirror annotations
# (default to `catalyst-system`) so the rendered Secret is
# available in the catalyst-api Pod's namespace via secretKeyRef.
# Companion to bp-catalyst-platform 1.4.225 which adds the
# secretKeyRef itself + the corrected CATALYST_NEWAPI_ADDR
# literal (`http://newapi-bp-newapi.newapi.svc.cluster.local:3000`).
version: 1.4.33
sourceRef:
kind: HelmRepository
name: bp-newapi
namespace: flux-system
# Event-driven install per docs/INVIOLABLE-PRINCIPLES.md #3 (Flux
# dependsOn is the gate, not Helm timeout). NewAPI itself starts in
# ~10 s once the Postgres DSN Secret is present; the long pole is
# waiting for the operator's Crossplane claim to materialise the DB.
install:
timeout: 15m
disableWait: true
remediation:
retries: 3
upgrade:
timeout: 15m
disableWait: true
remediation:
retries: 3
# Per-Sovereign overrides — the operator MUST supply at install time:
# - ingress.host = api.${SOVEREIGN_FQDN}
# - ingress.adminHost = admin.${SOVEREIGN_FQDN}
# - auth.adminUI.keycloak.issuer = https://auth.${SOVEREIGN_FQDN}/realms/ops
# - database.existingSecret = Postgres DSN Secret (from the
# Crossplane PostgresqlInstance claim)
# - credentials.existingSecret = SESSION_SECRET + CRYPTO_SECRET
# (rotated via OpenBao)
# - catalystIntegration.externalSecret.remoteRef.key
# = sovereign/${SOVEREIGN_FQDN}/newapi/admin-token
# - defaultChannels.vllm.enabled = true (first-otech)
# - defaultChannels.vllm.endpoint
# + defaultChannels.vllm.attestation.owner
#
# Defaults below wire the first-otech provider channel to the same
# upstream the OpenOva marketing site uses (Qwen via Axon →
# `llm-api.omtd.bankdhofar.com`, model `qwen3-coder`); the operator
# overlay overrides any of these by setting them in this HelmRelease's
# spec.values.
values:
sovereignFQDN: ${SOVEREIGN_FQDN}
ingress:
host: api.${SOVEREIGN_FQDN}
adminHost: admin.${SOVEREIGN_FQDN}
tls:
enabled: true
issuer: letsencrypt-prod
# Cilium Gateway HTTPRoute for `newapi.<fqdn>` (TBD-D35d, issue
# #1778). Sandbox runtimes hit the LLM gateway at the URL the
# sandbox controller mints into their environment
# (`NEWAPI_BASE_URL=https://newapi.${SOVEREIGN_FQDN}/v1`). Without
# this HTTPRoute the marketplace `tenant-wildcard` (hostnames=
# `*.${SOVEREIGN_FQDN}`) absorbs every newapi.${SOVEREIGN_FQDN}
# request and forwards to the storefront `console` Service —
# blocking the entire BYOS Claude Code journey at the LLM gate.
# An exact-hostname HTTPRoute outranks the wildcard per Gateway
# API spec, so enabling this on every Sovereign restores LLM
# reachability without touching the marketplace wildcard.
httpRoute:
enabled: true
host: newapi.${SOVEREIGN_FQDN}
auth:
adminUI:
mode: keycloak
keycloak:
issuer: https://auth.${SOVEREIGN_FQDN}/realms/ops
clientId: newapi-admin
existingSecret: newapi-oidc
customerAPI:
keyIssuer: catalyst
catalystIntegration:
enabled: true
existingSecret: catalyst-newapi-admin-token
externalSecret:
enabled: true
refreshInterval: "1h"
secretStoreRef:
kind: ClusterSecretStore
name: vault-region1
remoteRef:
# Canonical OpenBao path per docs/INVIOLABLE-PRINCIPLES.md #4.
# Under the `vault-region1` store's `secret/` mount the full
# path is `secret/sovereign/<fqdn>/newapi/admin-token`.
key: sovereign/${SOVEREIGN_FQDN}/newapi/admin-token
property: ADMIN_API_TOKEN
# Default channels — chart-side composition (channel #1 first).
#
# `qwenBankDhofar` (issue #915) is the canonical first channel:
# Qwen3.6 hosted at BankDhofar (https://llm-api.omtd.bankdhofar.com,
# model `qwen3-coder` / alias `qwen3.6`) — the SAME relay the
# OpenOva marketing site's Axon helmrelease consumes
# (openova-private/clusters/contabo-mkt/apps/axon/helmrelease.yaml).
# Disabled in the template so a fresh Sovereign does not silently
# wire customers to a third-party endpoint; per-Sovereign overlays
# (clusters/<sovereign>/bootstrap-kit/80-newapi.yaml) enable this
# block and supply:
# - defaultChannels.qwenBankDhofar.enabled = true
# - defaultChannels.qwenBankDhofar.endpoint = https://llm-api.omtd.bankdhofar.com
# - defaultChannels.qwenBankDhofar.attestation.accountId (legal-team-owned)
# - defaultChannels.qwenBankDhofar.attestation.contractRef (legal-team-owned)
# - the Secret `newapi-channel-qwen-bankdhofar` containing the
# upstream API key under key `API_KEY` (or an ExternalSecret
# pulling from OpenBao at
# `sovereign/<sovereign-fqdn>/newapi/channel-qwen-bankdhofar`)
# - auth.adminUI.masterKeySecret = name of a Secret carrying
# `MASTER_KEY` (NewAPI bootstrap admin auth) — required for
# the channel-seed Helm hook Job to POST against the admin API
# ONCE at install time. Operator may rotate the master key out
# post-bootstrap; channels persist in Postgres.
#
# When the operator flips `qwenBankDhofar.enabled: true`, the
# chart's post-install/post-upgrade `channel-seed` Job probes
# NewAPI's admin API (`/api/channel/?keyword=<name>`) and POSTs
# the channel definition idempotently. Re-runs after upgrades
# are no-ops once the channel exists.
#
# The legacy `vllm` slot (in-cluster vLLM fallback) remains for
# operators that run their own bp-vllm + open-weight model in-
# cluster; it composes after `qwenBankDhofar` and any operator
# `.Values.channels`.
# Sandbox Wave 4 (2026-05-18, retry of sandbox-wave4-newapi-sovereign-install):
# qwenBankDhofar is now gated on `${MARKETPLACE_ENABLED:-false}` — the
# same envsubst variable bp-catalyst-platform (slot 13) reads to flip
# marketplace.enabled on the Catalyst control plane. This lets a
# franchised Sovereign with `MARKETPLACE_ENABLED=true` auto-seed the
# default Bank Dhofar Qwen3.6 channel without the operator having to
# supply per-Sovereign overlay values. The endpoint defaults to the
# canonical first-otech relay; `LLM_BANK_DHOFAR_BASE_URL` overrides
# it (e.g. for staging at https://omtd.bankdhofar.com). The upstream
# API key MUST be present in the Secret `newapi-channel-qwen-bankdhofar`
# under key `API_KEY` — either pre-seeded by cloud-init or pulled from
# OpenBao via the operator's ExternalSecret at path
# `sovereign/<fqdn>/newapi/channel-qwen-bankdhofar`. Sandbox agents
# (sandbox-wave4) depend on this channel being live on every Sovereign
# that opted in to marketplace; without it the agents fall back to
# mothership newapi, defeating the per-Sovereign sandboxing.
defaultChannels:
qwenBankDhofar:
enabled: ${MARKETPLACE_ENABLED:-false}
name: qwen3.6-bankdhofar
endpoint: ${LLM_BANK_DHOFAR_BASE_URL:-https://llm-api.omtd.bankdhofar.com}
models:
- qwen3.6
- qwen3-coder
existingSecret: newapi-channel-qwen-bankdhofar
existingSecretKey: API_KEY
attestation:
kind: commercial-contract
accountId: ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}
vllm:
enabled: false
name: qwen
endpoint: ""
models:
- qwen3-coder
attestation:
kind: in-cluster
owner: ${SOVEREIGN_FQDN}