TBD-A42 (issue #1905): the `tenant-wildcard` HTTPRoute in
products/catalyst/chart/templates/sme-services/marketplace-routes.yaml
claimed `*.<global.sovereignFQDN>` and routed every match to
sme/console:8080. On Cilium Gateway, the wildcard route shadowed
exact-match platform HTTPRoutes (auth.<sov> -> keycloak, console.<sov> ->
catalyst-ui, api.<sov> -> catalyst-api, pdns.<sov> -> powerdns,
grafana.<sov> -> grafana, etc.) even though Gateway API spec section
5.2.1 says exact wins over wildcard. Admission-order-dependent
precedence on t31 meant `auth.t31.omani.works` returned 4836B Astro
HTML (SME console SPA) instead of Keycloak's login page, blocking D4
SSO PIN-bounce (#1807). Same precedence-collision family as
A30/A40/A32.
Fix: replace the single `tenant-wildcard` HTTPRoute with N explicit
per-slug HTTPRoutes named `tenant-<slug>` with hostname
`<slug>.<global.sovereignFQDN>` EXACT - no wildcard, no shadowing
possible by construction. Slug list comes from a new operator-supplied
`ingress.marketplace.tenantSlugs[]` value, default empty list. With
the default, ZERO catch-all routes are emitted, so platform subdomains
(auth/console/api/...) can NEVER be hijacked.
Per-tenant routes for Orgs created post-provision continue to be
written live by the organization-controller (templates/sme-services/
tenant-public-routes.yaml emits the byte-identical chart-side
analogue), so the SaaS-tenant traffic path is unchanged for any Org
the controller knows about.
marketplace-reference-grant.yaml already covers catalyst-system ->
sme/console - every new `tenant-<slug>` HTTPRoute is in
catalyst-system pointing at sme/console, so no grant change is needed.
Comment updated to note the wildcard->per-slug refactor.
Verified on t32 2026-05-19:
helm template ... --set ingress.marketplace.tenantSlugs={demo} \
| kubectl apply --dry-run=server
-> marketplace HTTPRoute configured + tenant-demo HTTPRoute created
Before fix the same template emitted `tenant-wildcard` with
`hostnames: ["*.t32.omani.works"]`; after fix, no catch-all is
rendered and `auth.t32.omani.works` is reachable by Keycloak's
exact-match HTTPRoute only.
Files changed:
- products/catalyst/chart/templates/sme-services/marketplace-routes.yaml
- products/catalyst/chart/values.yaml
- products/catalyst/chart/templates/sme-services/marketplace-reference-grant.yaml
- products/catalyst/chart/Chart.yaml (1.4.189 -> 1.4.190)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml (pin bump)
Closes#1905Closes#1807
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1912 was theater for the D29 customer-journey blocker. It was titled
"fix catalyst-system → sme/newapi egress" but only added world TCP/6443
and never extended `.Values.security.baselineCnp.allowedPlatformNamespaces`.
t32 fresh-prov walk (af1da1e7, 2026-05-19) confirmed the live CNP still
listed only [keycloak gitea powerdns cnpg-system openbao harbor nats-system
loki mimir tempo alloy opentelemetry external-secrets-system cert-manager].
Console → `gateway.sme.svc:8080` returned 503 `context deadline exceeded`.
Fix: append `sme` + `newapi` to the values default, extend
`tests/baseline-cnp-allowlist.sh` with Cases 5c + 5d so any future
narrowing fails Blueprint Release CI before the OCI artifact ships, bump
Chart.yaml 1.4.188 → 1.4.189, bump bootstrap-kit pin 1.4.188 → 1.4.189.
15/15 chart-tests green (was 13). kubectl --dry-run=server validation passes.
Closes#1920
Refs #1912
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-newapi): consume CNPG-managed app secret via sync-job (TBD-A39, Closes#1834)
D34 close-audit on t32 (2026-05-19) found newapi-bp-newapi in 21x
CrashLoopBackOff with `SASL auth: FATAL: password authentication failed
for user "newapi"`. Public probe to `newapi.t32.omani.works` returned
envoy 503 "no healthy upstream".
Root cause: chart's templates/cnpg-cluster.yaml rendered the DSN Secret
via Helm `lookup "v1" "Secret" .Release.Namespace <cluster>-app` at
template time. On every freshly-franchised Sovereign CNPG materialises
the `<cluster>-app` source Secret only AFTER bp-newapi's HelmRelease
applies, so the first render's lookup returns nil and the chart commits
the Secret with an empty password — literally
`postgres://newapi:@newapi-bp-newapi-newapi-pg-rw.../newapi?sslmode=require`.
The Secret carries `helm.sh/resource-policy: keep`, so Flux NEVER
overwrites the empty bytes on subsequent reconciles even after CNPG
populates the source. The chart's own header comment claims "the
1-minute Flux reconcile picks it up on the next tick" — verified false
in production; `resource-policy: keep` pins the empty bytes.
Fix:
- platform/newapi/chart/templates/cnpg-cluster.yaml: drop the Helm
`lookup` + DSN composition. The DSN Secret renders as a chart-managed
empty placeholder so kubelet can satisfy the Deployment's secretKeyRef
on first schedule (kubelet only checks the key EXISTS).
- platform/newapi/chart/templates/database-secret-sync-job.yaml (NEW):
Helm post-install/post-upgrade Job + ServiceAccount + Role + Binding.
The Job polls `<cluster>-app` (up to 10 min via curl + in-pod SA
token), reads the `password` bytes, composes the canonical
`postgres://<user>:<password>@<host>:5432/<db>?sslmode=<mode>` string,
and strategic-merge PATCHes it into the placeholder. Idempotent.
- platform/newapi/chart/Chart.yaml: version 1.4.26 → 1.4.27 with full
changelog block.
- clusters/_template/bootstrap-kit/80-newapi.yaml: bp-newapi pin
1.4.26 → 1.4.27.
Pattern lifted from platform/gitea/chart/templates/database-secret-
sync-job.yaml (canonical seam — issue #830 Bug 2, proven on otech30)
and platform/wordpress-tenant/chart/templates/database-secret-sync-
job.yaml (issue #1786, proven on t26).
Validation:
- `helm dep update && helm template newapi .` renders cleanly with
the placeholder Secret + Job + SA + Role + RoleBinding.
- `kubectl apply --dry-run=server` against t32 apiserver accepts all
11 rendered objects (server dry run).
Refs: TBD-A39
Closes: #1834
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-newapi): bump blueprint.yaml lockstep version to 1.4.27
Sync platform/newapi/blueprint.yaml spec.version with the Chart.yaml
bump in the preceding commit. TestBootstrapKit_BlueprintVersionLockstep
Sweep enforces these two stay aligned (TBD-A20, #1856).
Refs: TBD-A39
Refs: #1834
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The baseline-default-deny CiliumNetworkPolicy in catalyst-system listed
14 platform namespaces in its egress allow-list (keycloak, gitea,
powerdns, cnpg-system, openbao, harbor, nats-system, loki, mimir, tempo,
alloy, opentelemetry, external-secrets-system, cert-manager) but did NOT
include `sme`. The bp-sme-platform chart deploys the SME control-plane
into namespace `sme`, and console in catalyst-system reaches
`gateway.sme.svc.cluster.local:8080` for every voucher list / issue /
redeem call (plus admin reaches the same gateway for tenant onboarding).
Every such call was therefore dropped at the egress hook and timed out
at 5s, surfaced at the operator as 503 `context deadline exceeded` on
the voucher list / voucher issue panels.
Reproduction on t32 (2026-05-19, fresh prov, READ-ONLY):
$ kubectl exec -n catalyst-system catalyst-api-59d5cf5644-wrg4x \\
-- curl -m 5 http://gateway.sme.svc.cluster.local:8080/healthz
000 time=5.002937
curl: (28) Connection timed out after 5002 milliseconds
Live CNP egress excerpt (kubectl get cnp -n catalyst-system
baseline-default-deny -o yaml | yq '.spec.egress[3]'):
toEndpoints:
- matchExpressions:
- key: k8s:io.kubernetes.pod.namespace
operator: In
values:
- keycloak ... - cert-manager # (no 'sme')
Fix: add `sme` to BOTH the values.yaml default
(`.Values.security.baselineCnp.allowedPlatformNamespaces`) AND the
template's `default (list ...)` fallback, so a Helm install with no
values overrides still renders the allow.
Originally masqueraded under #1748 (voucher list 503) and #1749 (voucher
issue 503) — those were thought to be services-build 502 regressions,
but this is a distinct CNP-misconfig bug class.
Validation:
- `helm template` confirms rendered CNP now lists `sme` in egress.
- `kubectl apply --dry-run=server` against t32 apiserver passes
("ciliumnetworkpolicy.cilium.io/baseline-default-deny configured").
Chart bumped 1.4.188 → 1.4.189; bootstrap-kit pin bumped to match.
No live patching on t32 — fix verified via server-side dry-run only,
per Principle #15.
Closes#1917
Refs #1748
Refs #1749
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Adds a NATS-publish hook to HandleCreateSandboxSession so every
successful Sandbox CR Create emits a canonical
`catalyst.tenant.sandbox_requested` event. Sandbox-controller already
consumes this subject (core/controllers/sandbox/internal/controller/
nats_bridge.go) and tenant-service's SandboxOrchestrator publishes it
from the CRM side, but the catalyst-api FE-driven create path was
silently bypassing the audit stream — the symptom #1776 calls out.
Surface added:
- TenantEvent payload {tenant_id, sandbox_id, requested_by,
timestamp, spec_hash} matching the existing audit.Event field
naming convention. spec_hash is SHA-256 over the canonical
JSON-serialised .spec for drift detection.
- TenantEventPublisher interface on the Handler (nil-tolerant: when
unset the publish-side is a no-op so CI without CATALYST_NATS_URL
still passes; production wiring binds a real publisher).
- SetTenantEventPublisher setter mirroring SetAuditBus.
- Constant SandboxRequestedSubject = "catalyst.tenant.sandbox_requested"
so producer + consumer + tests share one symbol.
Wiring:
- main.go: newTenantEventPublisherFromEnv placeholder identical in
shape to newRBACAuditPublisherFromEnv. Returns nil today because
catalyst-api ships without nats.go in go.mod; the real publisher
lands in the same follow-up slice that swaps the RBAC stub.
CATALYST_NATS_URL gates the wiring; CATALYST_TENANT_NATS_SUBJECT_
PREFIX lets operators override the canonical prefix per
INVIOLABLE-PRINCIPLES.md #4.
Tests (6 new in sandbox_sessions_nats_test.go):
- PublishesSandboxRequested: happy-path — exactly one publish on the
canonical subject with all fields populated.
- NoPublisher_DoesNotFail: nil-tolerant — Sandbox Create still 201s
when no publisher is wired (CI, chroot).
- PublishError_DoesNotFailRequest: a NATS outage logs + continues;
the HTTP response stays 201 since the CR write already succeeded.
- PublishUsesNamespaceWhenOrgEmpty: single-tenant chroot fallback —
tenant_id falls back to the namespace (NOT the orgSlug, which
collapses to "default" and would conflate every chroot).
- PublishUsesSubWhenEmailEmpty: requested_by falls back to claims.Sub
so the field is never blank.
- SpecHash_DeterministicAcrossMapOrder: spec_hash stable across map
iteration; changes when spec changes.
- Subject_MatchesIssueContract: pins the exact subject string per
#1776 against accidental drift.
Sandbox-controller's consumer list (nats_bridge.go) already includes
this subject — no controller-side change required.
Closes#1776
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
* fix(bp-self-sovereign-cutover): post-cutover mirror re-sync CronJob (TBD-A37, Closes#1899)
Step-01 (gitea-mirror) only runs ONCE at cutover and produces a STANDALONE
local Gitea repo (PR #1029 — pull-mirror semantics block Step-06's
HelmRepository URL rewrite push). Without an ongoing re-sync, upstream
chart bumps merged AFTER cutover never reach the Sovereign.
Live regression on t31 2026-05-19 (A145 verifier): sandbox-controller
stuck at image :8017700 from 2026-05-16 even though PR #1862 had merged
2 days earlier with the NATS consume-leg — the upstream values.yaml
bump never crossed the seam.
This chart bump adds a gitea-mirror-resync CronJob (default schedule
"*/5 * * * *") that fires the same idempotent bare-clone + push
--mirror --force as Step-01 step (3) every 5 minutes. Pre-cutover
fires are no-ops (the script detects the local repo is missing /
empty and exits 0); post-cutover fires close the upstream → local
Gitea loop.
Why CronJob, not Gitea pull-mirror revival?
PR #1029 documented why Gitea pull-mirror was abandoned: pull-mirror
repos are read-only, blocking Step-06's HelmRepository URL rewrite
push. We need a writable local repo that ALSO refreshes from upstream
— the natural shape is a periodic force-push from a separate Job.
Why CronJob, not push-from-upstream webhook?
Slower to implement (requires GitHub App + webhook receiver on each
Sovereign + DNS for the webhook URL). Tracked as a future evolution
once stable; the CronJob is the minimal correct fix today.
Default 5m cadence covers the chart-bump → upstream-merge →
Sovereign-reconcile loop in ~10 min end-to-end while staying well
under GitHub anonymous-clone rate limits (300 req/hr per IP; one
Sovereign = 12 clones/hr). Per-Sovereign overlay knobs:
.Values.mirrorResync.schedule (cron string)
.Values.mirrorResync.suspended (bool, default false)
.Values.mirrorResync.jobTimeoutSeconds (default 900)
No new RBAC — the CronJob re-uses the existing cutover runner SA
and the reflector-mirrored gitea-admin-secret that Step-01 already
mounts. concurrencyPolicy: Forbid + startingDeadlineSeconds: 60
keep parallel runs / replay storms harmless.
Verification:
- helm template test . renders cleanly (2509 lines, +52 from 0.1.32)
- tests/cutover-contract.sh all 20 gates GREEN (CronJob doesn't carry
the cutover-step labels so the "exactly 9 step ConfigMaps" assertion
still passes)
- scripts/check-bootstrap-kit-pin-sync.sh PASS (50 chart→pin pairs)
Chart 0.1.32 → 0.1.33; bootstrap-kit pin in
clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml
bumped to match.
Closes#1899
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-self-sovereign-cutover): bump blueprint.yaml lockstep to 0.1.33
TBD-A20 BlueprintVersionLockstepSweep CI gate caught the missing
blueprint.yaml bump on PR #1916 (the chart Chart.yaml was bumped to
0.1.33 but blueprint.yaml still pinned 0.1.32). Bringing the two in
lockstep so the test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalyst-chart): propagate SMTP_USER/SMTP_PASS into notification Pod (TBD-X1, Refs #1793)
Wave 35 SMTP diagnostic root cause: notification.yaml only mounted
SMTP_HOST / SMTP_PORT / SMTP_FROM from sme-secrets, so the Go net/smtp
client dialed Stalwart without authentication. Stalwart's submission
listener rejected every message with 503 5.5.1 "You must authenticate
first" -> the (pre-companion-PR) fixed-60s retry storm slammed the
relay 3x per message x 5 tenants and tripped Stalwart's
[5 requests, 1000ms] rate-limiter for every tenant on the same relay.
Fix is a one-symmetry-line with auth.yaml, which has consumed SMTP_USER
and SMTP_PASS from sme-secrets since chart 1.4.20 (issue #934). This
template was an oversight from the same change-set.
The canonical SMTP-credentials propagation chain is already in place
and unchanged here:
mothership catalyst-openova-kc-credentials (key: smtp-user/smtp-pass)
-> sovereign_smtp_seed.go SeedSovereignSMTPCredentials
creates catalyst-system/sovereign-smtp-credentials on the new
Sovereign (Phase-1, idempotent)
-> sme-secrets.yaml lookup with source-wins precedence reads
smtp-user / smtp-pass and emits SMTP_USER / SMTP_PASS keys in
the per-tenant sme-secrets Secret
-> auth.yaml AND (now, this PR) notification.yaml mount those
two keys via secretKeyRef -> services-notification main.go reads
SMTP_USER + SMTP_PASS via getEnv() -> buildAuth wires
smtp.PlainAuth on every Send (companion PR services-notification
smtp.go).
Chart version bump 1.4.186 -> 1.4.187 per chart-release discipline.
helm template test-render products/catalyst/chart \
--set ingress.marketplace.enabled=true | grep SMTP_USER -A2
... shows both auth.yaml AND notification.yaml mount SMTP_USER from
sme-secrets keyed SMTP_USER (verified).
Companion PR: services-notification smtp.go upgrade to exponential
backoff + 3-in-90s circuit breaker so a future credential gap surfaces
loudly via ErrCircuitOpen and never restarts a rate-limiter storm.
Refs #1793
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bootstrap-kit): bump bp-catalyst-platform pin 1.4.186 -> 1.4.187 (TBD-X1, Refs #1793)
Chart bump in the previous commit changed Chart.yaml version:
1.4.186 -> 1.4.187 (TBD-X1 SMTP_USER/SMTP_PASS wiring). The
pin-sync-audit CI step caught the lockstep drift -- bootstrap-kit
HelmRelease.spec.chart.spec.version MUST match the chart's
Chart.yaml version exactly (see clusters/_template/bootstrap-kit/
13-bp-catalyst-platform.yaml header comment + feedback_21_principles).
Refs #1793
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 35 SMTP diagnostic root cause: sme-secrets lost SMTP_USERNAME /
SMTP_PASSWORD after sme stack redeploy. Notification pod's net/smtp
falls back to no-auth (Mailer.Auth was always nil, and main.go never
read SMTP_USER/SMTP_PASS from env) -> Stalwart returns 503 5.5.1 "You
must authenticate first" -> the prior fixed-60s retry loop slammed the
relay 3x per message x 5 tenants and tripped Stalwart's
[5 requests, 1000ms] rate-limiter for the whole submission listener.
This PR fixes the retry behaviour and surfaces auth state loudly:
1. Mailer.Auth now wired via smtp.PlainAuth(SMTP_USER, SMTP_PASS, host)
read from env in NewMailer. Either-or-neither is a slog.Warn + fall
back to no-auth (so the next 503 5.5.1 is the LOUD error path
instead of a silent half-broken creds).
2. Retry backoff is now exponential with a 30s floor (per issue spec
TBD-X1) and a 5-minute cap: 30s -> 60s -> 120s -> 240s -> 300s
(cap). Replaces the prior fixed 60s wait.
3. Circuit breaker (issue spec): 3 consecutive 503 5.5.1 responses
inside a 90s sliding window open the breaker. While open, Send()
short-circuits to ErrCircuitOpen for 120s cooldown -> the
notification consumer NACKs / dead-letters instead of slamming a
known-rate-limited relay. Window-aging means slow drips never
trip; a single 250 OK between storms resets the consecutive
counter via breakerResetOnSuccess.
All paths are test-seamed (sendMail / sleep / now). Tests cover:
- single-retry success keeps base backoff
- exponential doubling 30s -> 60s
- MaxBackoff cap on long storms
- breaker trips at exactly trip-th hit and aborts the in-flight retry
- short-circuit on subsequent Send while open
- cooldown elapses -> breaker re-closes via fakeNow advance
- slow-drip 503s age out of window and never trip
- non-rate-limit errors still pass through immediately (no retry)
- env-var parsing 30s floor preserved
- buildAuth half-config / both / neither matrix
go test ./core/services/notification/...: ok
Deployment-side wiring (the notification.yaml chart template gaining
SMTP_USER + SMTP_PASS env from sme-secrets) ships in a separate PR.
Refs #1793
Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1861 widened LoadSMETenantParentDomainsFromEnv to seed all four
canonical .omani.X TLDs (homes, rest, trade, works), but on a real
Sovereign that env-stub fallback path is BYPASSED. The mothership
imports a full deployment record with only the operator-selected
sme-pool entry, and GET /api/v1/sovereign/parent-domains reads from
the imported record (dep.Request.ParentDomains), not the env stub.
Result on t31 (2026-05-19, c703247a0de12508): the on-disk record
holds 1 primary (omani.works) + 1 sme-pool (omani.homes) = 2 rows.
/parent-domains?role=sme-pool returns 1 entry instead of 4. A
customer picking .omani.rest or .omani.trade on the marketplace
/addons subdomain picker — both options the UI hard-codes — fails
SME tenant signup with 422 invalid-parent-domain.
Fix shape (same pattern as PR #1893 / D21 owner UserAccess
bake-time seed): on every chroot-mode catalyst-api startup AND on
every fresh handover import, top up Request.ParentDomains with any
missing canonical TLD as role=sme-pool. Idempotent (a re-run is a
no-op when the pool is already full); mothership mode (SOVEREIGN_FQDN
unset) is a hard no-op; persists to disk so a Pod restart sees the
topped-up shape.
Dedup is against existing role=sme-pool rows only — a role=primary
row on the same name does NOT count, because the customer-facing
/addons picker validates against role=sme-pool entries via
FindParentDomain. The t31 shape (primary=omani.works AND
sme-pool=omani.works needed) is the real-world case.
Wired into two seams so a fresh prov AND a Pod restart both
converge: HandleDeploymentImport (post-import, fresh prov) and
restoreFromStore (per-record rehydration, Pod restart). Five guards
in chroot_parent_domains_seed_test.go: AllowedTLDs lockstep,
top-up shape (mirrors t31), idempotence, mothership no-op, nil-dep.
Drive-by: fixed a pre-existing build break in
sme_tenant_gitops.go's smeTenantBPKeycloak raw-string constant
(PR #1909 introduced literal backticks + a Go template action
inside a YAML comment; the action confused text/template at
render time → bp-keycloak.yaml render returned `unexpected EOF`).
Replaced with prose that describes the chart template behaviour
without inlining the template literal.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
The CNPG operator runs in the `cnpg-system` namespace, but the actual
Postgres workload Pods reconcile into the same namespace as the CNPG
`Cluster` CR — for the auto-provisioned-DB blueprints that's
`.Release.Namespace` (e.g. `newapi`, `harbor`). A NetworkPolicy egress
rule that namespace-selects on `cnpg-system` reaches the operator pods
only, NOT the Postgres workloads — every 5432 connection times out.
Verified live on t31: `newapi-bp-newapi-newapi-pg-1` runs in `newapi`
ns with label `cnpg.io/cluster=newapi-bp-newapi-newapi-pg`, while
`newapi-bp-newapi-…` is stuck 1/2 Ready with 20 restarts because its
egress NP allows 5432 only to `cnpg-system`.
Fix: every affected NP now selects the Postgres workload Pods by the
operator-emitted `cnpg.io/cluster=<clusterName>` Pod label — namespace-
agnostic, survives the operator namespace being different from the
data-plane namespace.
Charts fixed (4):
- bp-newapi (1.4.22 → 1.4.23) — auto-provisions CNPG Cluster in
`.Release.Namespace`. Removed the bogus `namespaceLabel: cnpg-system`
egress entry from values.yaml; added a podSelector-based rule
(cnpg.io/cluster=<release>-bp-newapi-newapi-pg) directly in the
template, gated by `.Values.cnpg.enabled`.
- bp-harbor (1.2.17 → 1.2.18) — Cluster CR in
`postgres.cluster.namespace | default .Release.Namespace` (default
`harbor`). Changed egress from namespaceSelector=cnpg to
podSelector cnpg.io/cluster=<postgres.cluster.name|default harbor-pg>.
- bp-matrix (1.0.0 → 1.0.1) — chart points at
matrix-postgres-rw.matrix.svc.cluster.local (Cluster CR in
`.Release.Namespace`). Replaced `cnpgNamespace` value with
`cnpgClusterName` (default `matrix-postgres`) and switched egress
rule to podSelector.
- bp-openmeter (1.0.0 → 1.0.1) — operator-supplied CNPG endpoint
pattern. Replaced `cnpgNamespace` with `cnpgClusterName` (default
`openmeter-pg`) and switched egress rule to podSelector. Same
pattern as matrix.
Audited and clean:
- bp-cnpg-pair: already uses podSelectors throughout.
- bp-wordpress-tenant: cnpgNamespaceLabel="" path resolves to
`.Release.Namespace` via the `cnpgNamespace` helper.
- bp-llm-gateway: already pod-selects on
`cnpg.io/cluster=bp-llm-gateway-audit`.
- bp-keycloak / bp-gitea / bp-grafana / bp-mimir: no own
networkpolicy.yaml template (grafana/mimir pass enabled=false
to upstream subcharts).
Validation:
- helm template render clean for all 4 charts.
- `kubectl apply --dry-run=server` on t31 — all 4 NetworkPolicies
accepted by the API server.
- Verbatim render confirms the auto-emitted cluster name matches the
label on the existing CNPG Pod (newapi-bp-newapi-newapi-pg).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
TBD-A45 — baseline-default-deny CNP world-egress block previously
allowed only 443/587/465/25, so catalyst-api fan-out to secondary
kube-apiservers on TCP/6443 (D5/D16/D20) silently timed out on the
informer reflector List() call and returned primary-only results.
A152 diagnostic on t31 (3-region fresh prov):
kubectl -n catalyst-system exec deploy/catalyst-api -- \
nc -zvw 3 49.12.210.78 6443
nc: connect to 49.12.210.78 port 6443 (tcp) timed out
vs. SAME endpoint from the bastion: open.
Fix:
- Add TCP/6443 to the world toEntities egress block in
templates/network-policies/baseline-catalyst-system.yaml. World scope
is correct per the OpenOva ClusterMesh model — inter-region link is
always DMZ over public IPs, secondary api-server LB FQDNs are
per-prov and unpredictable at chart-render time. Attack surface is
bounded by TLS client-cert auth (only secondary-region kubeconfigs
on the catalyst-api PVC hold valid certs).
- Extend tests/baseline-cnp-allowlist.sh (new Case 5b) so any future
narrowing of this block fails Blueprint Release publish CI before
the OCI artifact reaches a Sovereign.
- Bump chart 1.4.185 -> 1.4.186 with full Chart.yaml header changelog.
Real-cluster validation on t31 (primary, Cilium):
- kubectl apply -f rendered-cnp.yaml -> CNP patched
- nc from catalyst-api pod to 49.12.210.78:6443 -> open (was: timeout)
- nc from catalyst-api pod to 5.223.74.173:6443 -> open (was: timeout)
- catalyst-api rolled, new pod nc -> open (sticks across restarts)
chart/tests/baseline-cnp-allowlist.sh: 13/13 cases pass (was 12).
Closes#1908
Refs #1904 (this unblocks D5/D16/D20 fan-out RED)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Gitea 1.22+ no longer routes POST /api/v1/admin/orgs — that path is
GET-only (admin list) and returns 405 with `Allow: GET`. The supported
create endpoint is POST /api/v1/orgs (org-create-as-self): the
authenticated principal owns the new Org. Because the
organization-controller authenticates with the Gitea admin token
(catalyst-gitea-token, owner=gitea_admin), the admin user owns each
tenant Org — same semantic as the legacy admin path.
Symptom on t31: catalyst-organization-controller loops on
"gitea.EnsureOrg: create: gitea: POST .../api/v1/admin/orgs: HTTP 405",
blocking D29 Step 7 (tenant Gitea Org provisioning).
Real Gitea API proof (t31, Gitea 1.22.3):
- BEFORE: POST /api/v1/admin/orgs → 405 Method Not Allowed (Allow: GET)
- AFTER: POST /api/v1/orgs → 201 Created
- 422 on duplicate username → unchanged (still mapped to errAlreadyExists)
Closes#1906
Refs TBD-A43
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone
Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to
`https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)
when more than one parent zone is enabled. Every public HTTPRoute pinned
to `sectionName: https` got `Accepted=False NoMatchingListener` and the
hosted service 404'd / connection-refused.
That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes
shipped the same `sectionName: https` default in values.yaml, so on a
multi-zone Sovereign every blueprint route — gitea, grafana, harbor,
keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed
to attach. TBD-A40 / issue #1902.
Sweep verbatim:
$ git grep -nE 'sectionName:[[:space:]]+(https|"https")[[:space:]]*$' \
platform/*/chart/ products/ clusters/ core/ 2>/dev/null \
| grep -v 'platform/gateway-api/chart/templates'
platform/gitea/chart/values.yaml:168: sectionName: https
platform/grafana/chart/values.yaml:124: sectionName: https
platform/harbor/chart/values.yaml:437: sectionName: https
platform/keycloak/chart/values.yaml:482: sectionName: https
platform/newapi/chart/values.yaml:721: sectionName: https
platform/openbao/chart/values.yaml:72: sectionName: https
platform/powerdns/chart/values.yaml:407: sectionName: https
platform/stalwart-tenant/chart/values.yaml:297: sectionName: https
products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802: sectionName: https
Fix (Option C — omit sectionName, same as PR #1888):
- 8 blueprint values.yaml defaults flipped from `sectionName: https` to
`sectionName: ""`. The chart templates already guard with `{{- with
.Values.gateway.parentRef.sectionName }}`, so a blank value drops the
field entirely and Cilium Gateway matches by hostname filter.
- platform/newapi/chart/templates/httproute.yaml was the outlier: it
used `default "https" $parent.sectionName` which fell back to `https`
even when values.yaml said empty. Rewritten to `{{- with
$parent.sectionName }}` so empty drops the field — same pattern as
the other 7 blueprints.
- products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
renders a per-tenant bp-keycloak HelmRelease and injected
`sectionName: https` into spec.values. Flipped to `sectionName: ""`
so the bp-keycloak chart's `{{- with }}` guard drops the field.
Validation (real `helm template`, default values, gateway enabled, no
sectionName override) — Principle #15:
gitea : sectionName lines in rendered output = 0
grafana : sectionName lines in rendered output = 0
harbor : sectionName lines in rendered output = 0
keycloak : sectionName lines in rendered output = 0
openbao : sectionName lines in rendered output = 0
powerdns : sectionName lines in rendered output = 0
newapi : sectionName lines in rendered output = 0
stalwart-tenant : sectionName lines in rendered output = 0
Override path preserved — `--set ...parentRef.sectionName=https-omani-works`
on each chart renders `sectionName: "https-omani-works"` correctly,
so operators on single-zone clusters or non-Cilium gateways can still
pin explicitly via bootstrap-kit overlay.
helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint
error is pre-existing on origin/main, unrelated to this fix).
Chart bumps (each blueprint also bumps blueprint.yaml spec.version per
#817 lockstep):
bp-gitea 1.2.7 -> 1.2.8
bp-grafana 1.0.1 -> 1.0.2
bp-harbor 1.2.17 -> 1.2.18
bp-keycloak 1.4.5 -> 1.4.6
bp-newapi 1.4.22 -> 1.4.23
bp-openbao 1.2.16 -> 1.2.17
bp-powerdns 1.2.3 -> 1.2.4
bp-stalwart-tenant 0.1.2 -> 0.1.3
Refs TBD-A40.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A143 D29 walk on t31 caught the tenant.created Kafka consumer 403ing in
a 5s NAK-retry loop forever:
403 Forbidden: system:serviceaccount:sme:provisioning cannot create
resource "organizations" in API group "orgs.openova.io"
A29 PR #1860 shipped the Go consumer code that creates one Organization
CR per voucher checkout (D29 step 5) but did NOT bump the chart RBAC.
Step 5 fails -> steps 6/7/8 of the customer journey blocked.
Add to ClusterRole sme-provisioning:
- apiGroups: ["orgs.openova.io"]
resources: ["organizations"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Bump chart 1.4.184 -> 1.4.185.
Validation per Principle #15 (real kubectl auth can-i against t31, not jq grep):
$ kubectl --kubeconfig=/tmp/t31-primary.kubeconfig auth can-i create \
organizations.orgs.openova.io --as=system:serviceaccount:sme:provisioning
Warning: resource 'organizations' is not namespace scoped in group 'orgs.openova.io'
yes
Same `yes` for get / list / watch / update / patch / delete. Pre-fix
baseline was `no`. The ClusterRole was applied via `helm template . |
yq 'select(.kind==ClusterRole)' | kubectl apply -f -`, then can-i
re-run to confirm.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1889 added 10 Hetzner-LB annotations to `Gateway/cilium-gateway`
`spec.infrastructure.annotations`. The Gateway-API CRD declares
`maxProperties: 8` on that field, so Flux SSA rejected the manifest:
spec.infrastructure.annotations: Too many: 10: must have at most 8 items
→ Gateway never reconciled → cilium-gateway-cilium-gateway Service stayed
ClusterIP → no Hetzner LB at the Service layer → public TLS at
console.<fqdn>:443 reset at the handshake. Blocked t28/t29/t30 since
2026-05-19 00:50:35Z.
Fix (Option A per A130): drop the two health-check timing annotations
(health-check-interval, health-check-timeout). hcloud-CCM defaults match
the values we were declaring (15s / 10s) so runtime health-check
behaviour is unchanged. The remaining 8 annotations are the minimum set
required to materialise a public-IP TCP-health-checked Hetzner LB on the
correct location/type with the correct backend port.
Validated with `kubectl apply --dry-run=server` against the mothership
cluster (Principle #15 — IaC evaluator over text grep) before merge.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1892 (TBD-A32 listener wildcard depth) was admin-merged with
"verified via Python jsonencode() simulation" — but tofu HCL's
type-unification rule rejected the ternary at plan-time. Every new
prov failed at 23s. A128 hotfix (#1894) shipped with REAL tofu
validate evidence.
Codify the rule: for .tf/.tftpl use tofu validate / tofu plan; for
Helm use helm template piped to kubectl apply --dry-run=server; for
manifests use --dry-run=server (not client). Python json.dumps and
jq greps are theater — they accept structurally-different shapes
the IaC evaluator rejects.
Refs PR #1892, PR #1894 (A128 hotfix).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
PR #1892 (TBD-A32 fix for shared-zone collision) introduced an HCL
"Inconsistent conditional result types" error at infra/hetzner/main.tf
line 468. Every fresh prov failed at tofu plan in 23s, e.g. A127 t29
attempt (deployment 4afd9ebceea92547) at 2026-05-19 01:08:41Z.
Root cause: `local.per_prov_listeners` was defined as
local.parent_domains_includes_sovereign_fqdn ? [] : [HTTPS_obj, HTTP_obj]
HCL/tofu cannot unify the conditional arms: the true arm is `tuple([])`
(length 0) and the false arm is `tuple([obj_with_tls, obj_without_tls])`
(length 2). Even moving the conditional to the consumer line in
`concat()` did not fix it — the same length-0 vs length-2 tuple
unification still fails.
Fix: emit `per_prov_listeners` unconditionally as the 2-element tuple,
then suppress it at the `concat()` consumer with a for-iteration filter
[for l in local.per_prov_listeners : l if !<collides>]
which always produces a list (length 0 or 2 — same element type), so HCL
never needs to unify two tuple types.
Validated locally with OpenTofu v1.8.5 against a minimal tfvars fixture:
- `tofu validate` → "Success! The configuration is valid."
- `tofu console` with sovereign_fqdn="t29.omani.works", parent="omani.works":
emits 4 listeners (parent https/http for *.omani.works + per-prov
https-t29-omani-works/http-t29-omani-works for *.t29.omani.works) —
matches PR #1892's intent.
- `tofu console` with sovereign_fqdn="omani.works" (collision):
emits 2 listeners (only parent https/http) — collision guard preserved.
No chart bump; this is a tofu-only change. Re-closes #1886 after #1892
re-opened it via the type-mismatch regression.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
D21 (owner UserAccess CR) was previously only seeded by
auth_handover.go::seedOwnerUserAccess after a live PIN-login. The
zero-touch convergence verifier cannot drive a PIN-login from CI, so
D21 stayed RED on every fresh prov until an operator manually
authenticated — even though SOVEREIGN_FQDN + OPERATOR_EMAIL + the
UserAccess CRD are all stable on the chroot from bake-time onward.
This slice adds a bake-time goroutine in main() that calls the
existing handler.EnsureOwnerUserAccess against the in-cluster
dynamic client when:
- the dynamic client is non-nil (in-cluster mode),
- SOVEREIGN_FQDN env is set (chroot mode), and
- OPERATOR_EMAIL env is set (orgEmail stamped via sovereign-fqdn
ConfigMap).
Capped backoff (0/5/10/20/40s) tolerates the UserAccess CRD rolling
behind us. Idempotent — EnsureOwnerUserAccess folds AlreadyExists to
nil, so the existing handover-fired path still works without
regression. Each skip / converged / error path logs at Info or Warn
so an operator can confirm bake-time seeding from stdout without
scraping the CR.
Tests in cmd/api/main_test.go cover the happy path, all three skip
branches (nil client, empty SOVEREIGN_FQDN, empty OPERATOR_EMAIL),
and an idempotent re-run simulating Pod restart.
Refs A116 diagnostic; supersedes the handover-only seed path for
zero-touch verification.
Closes#1891
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Cilium Gateway template emits `hostname: *.<parent-zone>` listeners
(e.g. `*.omani.works`). Per Gateway-API spec wildcard semantics that
matches EXACTLY one label depth, so `foo.omani.works` matches but
`console.t28.omani.works` does NOT. On every shared-parent-zone topology
(every per-prov Sovereign under omani.works) the operator-facing FQDN
is 2-label-deep — `curl -skI https://console.t28.omani.works/` reset at
TLS handshake even though `sovereign-wildcard-tls-t28-omani-works`
already contained all 13 per-prov SANs.
Fix: locals.per_prov_listeners in infra/hetzner/main.tf appends an extra
listener pair hostnamed `*.<sovereign_fqdn>` bound to the per-prov cert
`sovereign-wildcard-tls-<fqdn-dashed>` rendered by
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml. Skipped when
sovereign_fqdn equals one of the declared parent-zone names (legacy
single-zone-on-apex case) so no duplicate listener-name Conflict.
Verified by simulated jsonencode against three scenarios:
1. t28 multi-zone (sovereign_fqdn=t28.omani.works, parent_domains=
[omani.works, omani.homes]) — emits 6 listeners:
https-omani-works hostname=*.omani.works cert=sovereign-wildcard-tls-omani-works
http-omani-works hostname=*.omani.works
https-omani-homes hostname=*.omani.homes cert=sovereign-wildcard-tls-omani-homes
http-omani-homes hostname=*.omani.homes
https-t28-omani-works hostname=*.t28.omani.works cert=sovereign-wildcard-tls-t28-omani-works
http-t28-omani-works hostname=*.t28.omani.works
2. t28 single parent zone (sovereign_fqdn=t28.omani.works,
parent_domains=[omani.works]) — emits 4 listeners (bare `https`/`http`
for backward-compat with legacy sectionName HTTPRoutes + per-prov
`https-t28-omani-works`/`http-t28-omani-works`).
3. Legacy apex (sovereign_fqdn=omani.works, parent_domains=
[omani.works]) — collision guard active, emits only bare `https`/`http`.
All scenarios produce unique listener names.
Safe because every catalyst-system HTTPRoute now omits sectionName
(PR #1888 closing #1884) — Cilium attaches via hostname match, so the
per-prov 2-label listener catches `console.<fqdn>` / `api.<fqdn>` /
`marketplace.<fqdn>` / etc.
Refs A110 t28 scorecard, A107 D29 walk.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Sovereign's Cilium Gateway listener `https-<parent-zone>` referenced
the parent-zone wildcard Secret `sovereign-wildcard-tls-<sanitised(parent)>`
(e.g. `sovereign-wildcard-tls-omani-works` for `*.omani.works`). That cert
is minted by `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml`
and SHARES Let's Encrypt's "5 New Certificates per Exact Set of Identifiers
per 168h" bucket with every other Sovereign on the same parent zone. After
~5 wipe+reprov cycles on `omani.works` the listener pinned to a
`Ready=False` Certificate (cert-manager spun the order forever, LE returned
`urn:ietf:params:acme:error:rateLimited`). A107 t28 evidence: per-prov cert
`sovereign-wildcard-tls-t28-omani-works` IS `Ready=True` but unused.
Fix (two parts):
1. `infra/hetzner/main.tf` — `parent_domains_listeners_yaml` now points
each listener's `tls.certificateRefs[0].name` at the PER-PROV cert
`sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}` (rendered by
`clusters/_template/sovereign-tls/cilium-gateway-cert.yaml` with the
explicit SAN list `[console.<sovereign-fqdn>, auth.<sovereign-fqdn>,
..., sandbox.<sovereign-fqdn>]`). Per-prov identifier sets get their
own 5/168h bucket per Sovereign so reprovs never share LE budget.
New `local.sovereign_fqdn_dashed = replace(var.sovereign_fqdn, ".",
"-")` is the SAME suffix `cilium-gateway-cert.yaml` /
`cilium-envoy-tls-restart-job.yaml` already use, so the listener +
cert + restart-job RBAC stay in lockstep.
2. `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` --
skip-render unconditionally (`{{- if false }}` wrap around the
`wildcardCert.enabled` guard). The parent-zone wildcards it minted
are no longer referenced by anything and burn LE budget on every
install. Template body kept for `git blame` / future revival under
issue #831 (multi-listener per-zone tenant TLS with non-wildcard SAN
lists). Removes 2 Certificate resources per multi-zone Sovereign.
Verification (helm template):
helm template products/catalyst/chart \
--set parentZones[0].name=omani.works --set parentZones[0].role=primary \
--set parentZones[1].name=omani.homes --set parentZones[1].role=sme-pool \
--set global.sovereignFQDN=t28.omani.works \
--set wildcardCert.enabled=true \
| grep -c 'sovereign-wildcard-cert'
# before: 2 (two parent-zone Certificates rendered)
# after: 0 (zero -- template skip-renders)
Chart bumped 1.4.182 -> 1.4.183 so the next Blueprint Release republishes
the OCI artifact with the skip-render change.
Hostname semantics unchanged: listener `hostname: *.<parent-zone>` still
matches any FQDN under the parent; cilium-envoy SNI dispatch serves the
per-prov cert whose SAN list covers the requested hostname (operator's
console/auth/gitea/etc. subdomains under `<sovereign-fqdn>`). Tenant
URLs under non-primary parent zones (`wp-foo.omani.homes`) remain out
of scope for A29; those need explicit per-tenant cert wiring via #831.
Closes#1883
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>