Commit Graph

2648 Commits

Author SHA1 Message Date
e3mrah
177b4d74de
fix(bp-catalyst-platform): scope console HTTPRoute to console.<fqdn>, free auth.<fqdn> for Keycloak (#1925)
TBD-A42 (issue #1905): the `tenant-wildcard` HTTPRoute in
products/catalyst/chart/templates/sme-services/marketplace-routes.yaml
claimed `*.<global.sovereignFQDN>` and routed every match to
sme/console:8080. On Cilium Gateway, the wildcard route shadowed
exact-match platform HTTPRoutes (auth.<sov> -> keycloak, console.<sov> ->
catalyst-ui, api.<sov> -> catalyst-api, pdns.<sov> -> powerdns,
grafana.<sov> -> grafana, etc.) even though Gateway API spec section
5.2.1 says exact wins over wildcard. Admission-order-dependent
precedence on t31 meant `auth.t31.omani.works` returned 4836B Astro
HTML (SME console SPA) instead of Keycloak's login page, blocking D4
SSO PIN-bounce (#1807). Same precedence-collision family as
A30/A40/A32.

Fix: replace the single `tenant-wildcard` HTTPRoute with N explicit
per-slug HTTPRoutes named `tenant-<slug>` with hostname
`<slug>.<global.sovereignFQDN>` EXACT - no wildcard, no shadowing
possible by construction. Slug list comes from a new operator-supplied
`ingress.marketplace.tenantSlugs[]` value, default empty list. With
the default, ZERO catch-all routes are emitted, so platform subdomains
(auth/console/api/...) can NEVER be hijacked.

Per-tenant routes for Orgs created post-provision continue to be
written live by the organization-controller (templates/sme-services/
tenant-public-routes.yaml emits the byte-identical chart-side
analogue), so the SaaS-tenant traffic path is unchanged for any Org
the controller knows about.

marketplace-reference-grant.yaml already covers catalyst-system ->
sme/console - every new `tenant-<slug>` HTTPRoute is in
catalyst-system pointing at sme/console, so no grant change is needed.
Comment updated to note the wildcard->per-slug refactor.

Verified on t32 2026-05-19:
  helm template ... --set ingress.marketplace.tenantSlugs={demo} \
    | kubectl apply --dry-run=server
  -> marketplace HTTPRoute configured + tenant-demo HTTPRoute created
  Before fix the same template emitted `tenant-wildcard` with
  `hostnames: ["*.t32.omani.works"]`; after fix, no catch-all is
  rendered and `auth.t32.omani.works` is reachable by Keycloak's
  exact-match HTTPRoute only.

Files changed:
- products/catalyst/chart/templates/sme-services/marketplace-routes.yaml
- products/catalyst/chart/values.yaml
- products/catalyst/chart/templates/sme-services/marketplace-reference-grant.yaml
- products/catalyst/chart/Chart.yaml (1.4.189 -> 1.4.190)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml (pin bump)

Closes #1905
Closes #1807

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:12:53 +04:00
github-actions[bot]
9539c03b59 deploy: update catalyst images to 174fc70 2026-05-19 07:09:18 +00:00
e3mrah
174fc703b1
fix(catalyst-cnp): add sme + newapi NS to baseline-default-deny egress (TBD-A43) (#1923)
PR #1912 was theater for the D29 customer-journey blocker. It was titled
"fix catalyst-system → sme/newapi egress" but only added world TCP/6443
and never extended `.Values.security.baselineCnp.allowedPlatformNamespaces`.
t32 fresh-prov walk (af1da1e7, 2026-05-19) confirmed the live CNP still
listed only [keycloak gitea powerdns cnpg-system openbao harbor nats-system
loki mimir tempo alloy opentelemetry external-secrets-system cert-manager].

Console → `gateway.sme.svc:8080` returned 503 `context deadline exceeded`.

Fix: append `sme` + `newapi` to the values default, extend
`tests/baseline-cnp-allowlist.sh` with Cases 5c + 5d so any future
narrowing fails Blueprint Release CI before the OCI artifact ships, bump
Chart.yaml 1.4.188 → 1.4.189, bump bootstrap-kit pin 1.4.188 → 1.4.189.

15/15 chart-tests green (was 13). kubectl --dry-run=server validation passes.

Closes #1920
Refs #1912

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:07:05 +04:00
hatiyildiz
18f7801759 deploy(bp-newapi): bump bootstrap-kit pin 1.4.27 -> 1.4.28 (auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version 1.4.27 -> 1.4.28 (Refs TBD-A20, #1856).
2026-05-19 06:58:22 +00:00
github-actions[bot]
ecb0974704 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.28 2026-05-19 06:57:45 +00:00
e3mrah
472e8c69f9
fix(bp-newapi): consume CNPG-managed app secret instead of stale DSN (TBD-A39, Closes #1834) (#1921)
* fix(bp-newapi): consume CNPG-managed app secret via sync-job (TBD-A39, Closes #1834)

D34 close-audit on t32 (2026-05-19) found newapi-bp-newapi in 21x
CrashLoopBackOff with `SASL auth: FATAL: password authentication failed
for user "newapi"`. Public probe to `newapi.t32.omani.works` returned
envoy 503 "no healthy upstream".

Root cause: chart's templates/cnpg-cluster.yaml rendered the DSN Secret
via Helm `lookup "v1" "Secret" .Release.Namespace <cluster>-app` at
template time. On every freshly-franchised Sovereign CNPG materialises
the `<cluster>-app` source Secret only AFTER bp-newapi's HelmRelease
applies, so the first render's lookup returns nil and the chart commits
the Secret with an empty password — literally
`postgres://newapi:@newapi-bp-newapi-newapi-pg-rw.../newapi?sslmode=require`.
The Secret carries `helm.sh/resource-policy: keep`, so Flux NEVER
overwrites the empty bytes on subsequent reconciles even after CNPG
populates the source. The chart's own header comment claims "the
1-minute Flux reconcile picks it up on the next tick" — verified false
in production; `resource-policy: keep` pins the empty bytes.

Fix:
- platform/newapi/chart/templates/cnpg-cluster.yaml: drop the Helm
  `lookup` + DSN composition. The DSN Secret renders as a chart-managed
  empty placeholder so kubelet can satisfy the Deployment's secretKeyRef
  on first schedule (kubelet only checks the key EXISTS).
- platform/newapi/chart/templates/database-secret-sync-job.yaml (NEW):
  Helm post-install/post-upgrade Job + ServiceAccount + Role + Binding.
  The Job polls `<cluster>-app` (up to 10 min via curl + in-pod SA
  token), reads the `password` bytes, composes the canonical
  `postgres://<user>:<password>@<host>:5432/<db>?sslmode=<mode>` string,
  and strategic-merge PATCHes it into the placeholder. Idempotent.
- platform/newapi/chart/Chart.yaml: version 1.4.26 → 1.4.27 with full
  changelog block.
- clusters/_template/bootstrap-kit/80-newapi.yaml: bp-newapi pin
  1.4.26 → 1.4.27.

Pattern lifted from platform/gitea/chart/templates/database-secret-
sync-job.yaml (canonical seam — issue #830 Bug 2, proven on otech30)
and platform/wordpress-tenant/chart/templates/database-secret-sync-
job.yaml (issue #1786, proven on t26).

Validation:
- `helm dep update && helm template newapi .` renders cleanly with
  the placeholder Secret + Job + SA + Role + RoleBinding.
- `kubectl apply --dry-run=server` against t32 apiserver accepts all
  11 rendered objects (server dry run).

Refs: TBD-A39
Closes: #1834

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-newapi): bump blueprint.yaml lockstep version to 1.4.27

Sync platform/newapi/blueprint.yaml spec.version with the Chart.yaml
bump in the preceding commit. TestBootstrapKit_BlueprintVersionLockstep
Sweep enforces these two stay aligned (TBD-A20, #1856).

Refs: TBD-A39
Refs: #1834

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:57:20 +04:00
github-actions[bot]
7aa02a21b1 deploy: update catalyst images to bf577e9 2026-05-19 06:52:05 +00:00
e3mrah
bf577e9d7b
fix(bp-sme): allow egress from catalyst-system to gateway:8080 (TBD-A38, Closes #1917) (#1919)
The baseline-default-deny CiliumNetworkPolicy in catalyst-system listed
14 platform namespaces in its egress allow-list (keycloak, gitea,
powerdns, cnpg-system, openbao, harbor, nats-system, loki, mimir, tempo,
alloy, opentelemetry, external-secrets-system, cert-manager) but did NOT
include `sme`. The bp-sme-platform chart deploys the SME control-plane
into namespace `sme`, and console in catalyst-system reaches
`gateway.sme.svc.cluster.local:8080` for every voucher list / issue /
redeem call (plus admin reaches the same gateway for tenant onboarding).
Every such call was therefore dropped at the egress hook and timed out
at 5s, surfaced at the operator as 503 `context deadline exceeded` on
the voucher list / voucher issue panels.

Reproduction on t32 (2026-05-19, fresh prov, READ-ONLY):

  $ kubectl exec -n catalyst-system catalyst-api-59d5cf5644-wrg4x \\
      -- curl -m 5 http://gateway.sme.svc.cluster.local:8080/healthz
  000 time=5.002937
  curl: (28) Connection timed out after 5002 milliseconds

Live CNP egress excerpt (kubectl get cnp -n catalyst-system
baseline-default-deny -o yaml | yq '.spec.egress[3]'):

  toEndpoints:
    - matchExpressions:
        - key: k8s:io.kubernetes.pod.namespace
          operator: In
          values:
            - keycloak  ... - cert-manager   # (no 'sme')

Fix: add `sme` to BOTH the values.yaml default
(`.Values.security.baselineCnp.allowedPlatformNamespaces`) AND the
template's `default (list ...)` fallback, so a Helm install with no
values overrides still renders the allow.

Originally masqueraded under #1748 (voucher list 503) and #1749 (voucher
issue 503) — those were thought to be services-build 502 regressions,
but this is a distinct CNP-misconfig bug class.

Validation:
- `helm template` confirms rendered CNP now lists `sme` in egress.
- `kubectl apply --dry-run=server` against t32 apiserver passes
  ("ciliumnetworkpolicy.cilium.io/baseline-default-deny configured").

Chart bumped 1.4.188 → 1.4.189; bootstrap-kit pin bumped to match.
No live patching on t32 — fix verified via server-side dry-run only,
per Principle #15.

Closes #1917
Refs #1748
Refs #1749

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 10:49:47 +04:00
e3mrah
446da60ca4
feat(catalyst-api): publish catalyst.tenant.sandbox_requested on Sandbox create (#1918)
Adds a NATS-publish hook to HandleCreateSandboxSession so every
successful Sandbox CR Create emits a canonical
`catalyst.tenant.sandbox_requested` event. Sandbox-controller already
consumes this subject (core/controllers/sandbox/internal/controller/
nats_bridge.go) and tenant-service's SandboxOrchestrator publishes it
from the CRM side, but the catalyst-api FE-driven create path was
silently bypassing the audit stream — the symptom #1776 calls out.

Surface added:
  - TenantEvent payload {tenant_id, sandbox_id, requested_by,
    timestamp, spec_hash} matching the existing audit.Event field
    naming convention. spec_hash is SHA-256 over the canonical
    JSON-serialised .spec for drift detection.
  - TenantEventPublisher interface on the Handler (nil-tolerant: when
    unset the publish-side is a no-op so CI without CATALYST_NATS_URL
    still passes; production wiring binds a real publisher).
  - SetTenantEventPublisher setter mirroring SetAuditBus.
  - Constant SandboxRequestedSubject = "catalyst.tenant.sandbox_requested"
    so producer + consumer + tests share one symbol.

Wiring:
  - main.go: newTenantEventPublisherFromEnv placeholder identical in
    shape to newRBACAuditPublisherFromEnv. Returns nil today because
    catalyst-api ships without nats.go in go.mod; the real publisher
    lands in the same follow-up slice that swaps the RBAC stub.
    CATALYST_NATS_URL gates the wiring; CATALYST_TENANT_NATS_SUBJECT_
    PREFIX lets operators override the canonical prefix per
    INVIOLABLE-PRINCIPLES.md #4.

Tests (6 new in sandbox_sessions_nats_test.go):
  - PublishesSandboxRequested: happy-path — exactly one publish on the
    canonical subject with all fields populated.
  - NoPublisher_DoesNotFail: nil-tolerant — Sandbox Create still 201s
    when no publisher is wired (CI, chroot).
  - PublishError_DoesNotFailRequest: a NATS outage logs + continues;
    the HTTP response stays 201 since the CR write already succeeded.
  - PublishUsesNamespaceWhenOrgEmpty: single-tenant chroot fallback —
    tenant_id falls back to the namespace (NOT the orgSlug, which
    collapses to "default" and would conflate every chroot).
  - PublishUsesSubWhenEmailEmpty: requested_by falls back to claims.Sub
    so the field is never blank.
  - SpecHash_DeterministicAcrossMapOrder: spec_hash stable across map
    iteration; changes when spec changes.
  - Subject_MatchesIssueContract: pins the exact subject string per
    #1776 against accidental drift.

Sandbox-controller's consumer list (nats_bridge.go) already includes
this subject — no controller-side change required.

Closes #1776

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 10:48:18 +04:00
e3mrah
f6334cd023
fix(bp-gitea+bp-harbor): shorten mirror interval to 5m for post-cutover freshness (TBD-A37, Closes #1899) (#1916)
* fix(bp-self-sovereign-cutover): post-cutover mirror re-sync CronJob (TBD-A37, Closes #1899)

Step-01 (gitea-mirror) only runs ONCE at cutover and produces a STANDALONE
local Gitea repo (PR #1029 — pull-mirror semantics block Step-06's
HelmRepository URL rewrite push). Without an ongoing re-sync, upstream
chart bumps merged AFTER cutover never reach the Sovereign.

Live regression on t31 2026-05-19 (A145 verifier): sandbox-controller
stuck at image :8017700 from 2026-05-16 even though PR #1862 had merged
2 days earlier with the NATS consume-leg — the upstream values.yaml
bump never crossed the seam.

This chart bump adds a gitea-mirror-resync CronJob (default schedule
"*/5 * * * *") that fires the same idempotent bare-clone + push
--mirror --force as Step-01 step (3) every 5 minutes. Pre-cutover
fires are no-ops (the script detects the local repo is missing /
empty and exits 0); post-cutover fires close the upstream → local
Gitea loop.

Why CronJob, not Gitea pull-mirror revival?
PR #1029 documented why Gitea pull-mirror was abandoned: pull-mirror
repos are read-only, blocking Step-06's HelmRepository URL rewrite
push. We need a writable local repo that ALSO refreshes from upstream
— the natural shape is a periodic force-push from a separate Job.

Why CronJob, not push-from-upstream webhook?
Slower to implement (requires GitHub App + webhook receiver on each
Sovereign + DNS for the webhook URL). Tracked as a future evolution
once stable; the CronJob is the minimal correct fix today.

Default 5m cadence covers the chart-bump → upstream-merge →
Sovereign-reconcile loop in ~10 min end-to-end while staying well
under GitHub anonymous-clone rate limits (300 req/hr per IP; one
Sovereign = 12 clones/hr). Per-Sovereign overlay knobs:
  .Values.mirrorResync.schedule          (cron string)
  .Values.mirrorResync.suspended         (bool, default false)
  .Values.mirrorResync.jobTimeoutSeconds (default 900)

No new RBAC — the CronJob re-uses the existing cutover runner SA
and the reflector-mirrored gitea-admin-secret that Step-01 already
mounts. concurrencyPolicy: Forbid + startingDeadlineSeconds: 60
keep parallel runs / replay storms harmless.

Verification:
- helm template test . renders cleanly (2509 lines, +52 from 0.1.32)
- tests/cutover-contract.sh all 20 gates GREEN (CronJob doesn't carry
  the cutover-step labels so the "exactly 9 step ConfigMaps" assertion
  still passes)
- scripts/check-bootstrap-kit-pin-sync.sh PASS (50 chart→pin pairs)

Chart 0.1.32 → 0.1.33; bootstrap-kit pin in
clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml
bumped to match.

Closes #1899

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-self-sovereign-cutover): bump blueprint.yaml lockstep to 0.1.33

TBD-A20 BlueprintVersionLockstepSweep CI gate caught the missing
blueprint.yaml bump on PR #1916 (the chart Chart.yaml was bumped to
0.1.33 but blueprint.yaml still pinned 0.1.32). Bringing the two in
lockstep so the test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:42:11 +04:00
hatiyildiz
ba4c2687f5 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.187 -> 1.4.188 (auto, Refs TBD-A6) 2026-05-19 06:40:15 +00:00
github-actions[bot]
1bb2e4b481 deploy: update sme service images to cbfb3ad + bump chart to 1.4.188 2026-05-19 06:39:37 +00:00
e3mrah
84ebcbeacf
fix(catalyst-chart): propagate SMTP_USER/SMTP_PASS into notification Pod (TBD-X1) (#1915)
* fix(catalyst-chart): propagate SMTP_USER/SMTP_PASS into notification Pod (TBD-X1, Refs #1793)

Wave 35 SMTP diagnostic root cause: notification.yaml only mounted
SMTP_HOST / SMTP_PORT / SMTP_FROM from sme-secrets, so the Go net/smtp
client dialed Stalwart without authentication. Stalwart's submission
listener rejected every message with 503 5.5.1 "You must authenticate
first" -> the (pre-companion-PR) fixed-60s retry storm slammed the
relay 3x per message x 5 tenants and tripped Stalwart's
[5 requests, 1000ms] rate-limiter for every tenant on the same relay.

Fix is a one-symmetry-line with auth.yaml, which has consumed SMTP_USER
and SMTP_PASS from sme-secrets since chart 1.4.20 (issue #934). This
template was an oversight from the same change-set.

The canonical SMTP-credentials propagation chain is already in place
and unchanged here:

  mothership catalyst-openova-kc-credentials (key: smtp-user/smtp-pass)
    -> sovereign_smtp_seed.go SeedSovereignSMTPCredentials
       creates catalyst-system/sovereign-smtp-credentials on the new
       Sovereign (Phase-1, idempotent)
    -> sme-secrets.yaml lookup with source-wins precedence reads
       smtp-user / smtp-pass and emits SMTP_USER / SMTP_PASS keys in
       the per-tenant sme-secrets Secret
    -> auth.yaml AND (now, this PR) notification.yaml mount those
       two keys via secretKeyRef -> services-notification main.go reads
       SMTP_USER + SMTP_PASS via getEnv() -> buildAuth wires
       smtp.PlainAuth on every Send (companion PR services-notification
       smtp.go).

Chart version bump 1.4.186 -> 1.4.187 per chart-release discipline.

helm template test-render products/catalyst/chart \
  --set ingress.marketplace.enabled=true | grep SMTP_USER -A2
... shows both auth.yaml AND notification.yaml mount SMTP_USER from
sme-secrets keyed SMTP_USER (verified).

Companion PR: services-notification smtp.go upgrade to exponential
backoff + 3-in-90s circuit breaker so a future credential gap surfaces
loudly via ErrCircuitOpen and never restarts a rate-limiter storm.

Refs #1793

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-kit): bump bp-catalyst-platform pin 1.4.186 -> 1.4.187 (TBD-X1, Refs #1793)

Chart bump in the previous commit changed Chart.yaml version:
1.4.186 -> 1.4.187 (TBD-X1 SMTP_USER/SMTP_PASS wiring). The
pin-sync-audit CI step caught the lockstep drift -- bootstrap-kit
HelmRelease.spec.chart.spec.version MUST match the chart's
Chart.yaml version exactly (see clusters/_template/bootstrap-kit/
13-bp-catalyst-platform.yaml header comment + feedback_21_principles).

Refs #1793

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:38:29 +04:00
e3mrah
cbfb3adfbe
fix(notification): exponential backoff + circuit breaker on 503 5.5.1 SMTP rate-limit (TBD-X1, Refs #1793) (#1914)
Wave 35 SMTP diagnostic root cause: sme-secrets lost SMTP_USERNAME /
SMTP_PASSWORD after sme stack redeploy. Notification pod's net/smtp
falls back to no-auth (Mailer.Auth was always nil, and main.go never
read SMTP_USER/SMTP_PASS from env) -> Stalwart returns 503 5.5.1 "You
must authenticate first" -> the prior fixed-60s retry loop slammed the
relay 3x per message x 5 tenants and tripped Stalwart's
[5 requests, 1000ms] rate-limiter for the whole submission listener.

This PR fixes the retry behaviour and surfaces auth state loudly:

1. Mailer.Auth now wired via smtp.PlainAuth(SMTP_USER, SMTP_PASS, host)
   read from env in NewMailer. Either-or-neither is a slog.Warn + fall
   back to no-auth (so the next 503 5.5.1 is the LOUD error path
   instead of a silent half-broken creds).

2. Retry backoff is now exponential with a 30s floor (per issue spec
   TBD-X1) and a 5-minute cap: 30s -> 60s -> 120s -> 240s -> 300s
   (cap). Replaces the prior fixed 60s wait.

3. Circuit breaker (issue spec): 3 consecutive 503 5.5.1 responses
   inside a 90s sliding window open the breaker. While open, Send()
   short-circuits to ErrCircuitOpen for 120s cooldown -> the
   notification consumer NACKs / dead-letters instead of slamming a
   known-rate-limited relay. Window-aging means slow drips never
   trip; a single 250 OK between storms resets the consecutive
   counter via breakerResetOnSuccess.

All paths are test-seamed (sendMail / sleep / now). Tests cover:
- single-retry success keeps base backoff
- exponential doubling 30s -> 60s
- MaxBackoff cap on long storms
- breaker trips at exactly trip-th hit and aborts the in-flight retry
- short-circuit on subsequent Send while open
- cooldown elapses -> breaker re-closes via fakeNow advance
- slow-drip 503s age out of window and never trip
- non-rate-limit errors still pass through immediately (no retry)
- env-var parsing 30s floor preserved
- buildAuth half-config / both / neither matrix

go test ./core/services/notification/...: ok

Deployment-side wiring (the notification.yaml chart template gaining
SMTP_USER + SMTP_PASS env from sme-secrets) ships in a separate PR.

Refs #1793

Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:38:22 +04:00
github-actions[bot]
298d404632 deploy: update catalyst images to 618273c 2026-05-19 04:40:37 +00:00
e3mrah
618273c484
fix(catalyst-api): bake-time top-up of canonical .omani.X sme-pool (TBD-A44, Closes #1907) (#1913)
PR #1861 widened LoadSMETenantParentDomainsFromEnv to seed all four
canonical .omani.X TLDs (homes, rest, trade, works), but on a real
Sovereign that env-stub fallback path is BYPASSED. The mothership
imports a full deployment record with only the operator-selected
sme-pool entry, and GET /api/v1/sovereign/parent-domains reads from
the imported record (dep.Request.ParentDomains), not the env stub.

Result on t31 (2026-05-19, c703247a0de12508): the on-disk record
holds 1 primary (omani.works) + 1 sme-pool (omani.homes) = 2 rows.
/parent-domains?role=sme-pool returns 1 entry instead of 4. A
customer picking .omani.rest or .omani.trade on the marketplace
/addons subdomain picker — both options the UI hard-codes — fails
SME tenant signup with 422 invalid-parent-domain.

Fix shape (same pattern as PR #1893 / D21 owner UserAccess
bake-time seed): on every chroot-mode catalyst-api startup AND on
every fresh handover import, top up Request.ParentDomains with any
missing canonical TLD as role=sme-pool. Idempotent (a re-run is a
no-op when the pool is already full); mothership mode (SOVEREIGN_FQDN
unset) is a hard no-op; persists to disk so a Pod restart sees the
topped-up shape.

Dedup is against existing role=sme-pool rows only — a role=primary
row on the same name does NOT count, because the customer-facing
/addons picker validates against role=sme-pool entries via
FindParentDomain. The t31 shape (primary=omani.works AND
sme-pool=omani.works needed) is the real-world case.

Wired into two seams so a fresh prov AND a Pod restart both
converge: HandleDeploymentImport (post-import, fresh prov) and
restoreFromStore (per-record rehydration, Pod restart). Five guards
in chroot_parent_domains_seed_test.go: AllowedTLDs lockstep,
top-up shape (mirrors t31), idempotence, mothership no-op, nil-dep.

Drive-by: fixed a pre-existing build break in
sme_tenant_gitops.go's smeTenantBPKeycloak raw-string constant
(PR #1909 introduced literal backticks + a Go template action
inside a YAML comment; the action confused text/template at
render time → bp-keycloak.yaml render returned `unexpected EOF`).
Replaced with prose that describes the chart template behaviour
without inlining the template literal.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 08:38:24 +04:00
hatiyildiz
5d8a9c2a4f deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.26 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1) 2026-05-19 04:04:07 +00:00
hatiyildiz
a100f82d27 deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.25 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1) 2026-05-19 04:03:48 +00:00
hatiyildiz
d1bb5758da deploy(bp-openmeter): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1) 2026-05-19 04:03:39 +00:00
hatiyildiz
6d38089895 deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.19 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 2) 2026-05-19 04:03:38 +00:00
hatiyildiz
707563bc52 deploy(bp-matrix): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1) 2026-05-19 04:03:36 +00:00
github-actions[bot]
dee7703413 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.26 2026-05-19 04:03:27 +00:00
e3mrah
59980125ed
fix(networkpolicy): egress to CNPG data-plane Pods, not cnpg-system operator NS (TBD-A39, Closes #1901) (#1911)
The CNPG operator runs in the `cnpg-system` namespace, but the actual
Postgres workload Pods reconcile into the same namespace as the CNPG
`Cluster` CR — for the auto-provisioned-DB blueprints that's
`.Release.Namespace` (e.g. `newapi`, `harbor`). A NetworkPolicy egress
rule that namespace-selects on `cnpg-system` reaches the operator pods
only, NOT the Postgres workloads — every 5432 connection times out.

Verified live on t31: `newapi-bp-newapi-newapi-pg-1` runs in `newapi`
ns with label `cnpg.io/cluster=newapi-bp-newapi-newapi-pg`, while
`newapi-bp-newapi-…` is stuck 1/2 Ready with 20 restarts because its
egress NP allows 5432 only to `cnpg-system`.

Fix: every affected NP now selects the Postgres workload Pods by the
operator-emitted `cnpg.io/cluster=<clusterName>` Pod label — namespace-
agnostic, survives the operator namespace being different from the
data-plane namespace.

Charts fixed (4):

  - bp-newapi (1.4.22 → 1.4.23) — auto-provisions CNPG Cluster in
    `.Release.Namespace`. Removed the bogus `namespaceLabel: cnpg-system`
    egress entry from values.yaml; added a podSelector-based rule
    (cnpg.io/cluster=<release>-bp-newapi-newapi-pg) directly in the
    template, gated by `.Values.cnpg.enabled`.

  - bp-harbor (1.2.17 → 1.2.18) — Cluster CR in
    `postgres.cluster.namespace | default .Release.Namespace` (default
    `harbor`). Changed egress from namespaceSelector=cnpg to
    podSelector cnpg.io/cluster=<postgres.cluster.name|default harbor-pg>.

  - bp-matrix (1.0.0 → 1.0.1) — chart points at
    matrix-postgres-rw.matrix.svc.cluster.local (Cluster CR in
    `.Release.Namespace`). Replaced `cnpgNamespace` value with
    `cnpgClusterName` (default `matrix-postgres`) and switched egress
    rule to podSelector.

  - bp-openmeter (1.0.0 → 1.0.1) — operator-supplied CNPG endpoint
    pattern. Replaced `cnpgNamespace` with `cnpgClusterName` (default
    `openmeter-pg`) and switched egress rule to podSelector. Same
    pattern as matrix.

Audited and clean:

  - bp-cnpg-pair: already uses podSelectors throughout.
  - bp-wordpress-tenant: cnpgNamespaceLabel="" path resolves to
    `.Release.Namespace` via the `cnpgNamespace` helper.
  - bp-llm-gateway: already pod-selects on
    `cnpg.io/cluster=bp-llm-gateway-audit`.
  - bp-keycloak / bp-gitea / bp-grafana / bp-mimir: no own
    networkpolicy.yaml template (grafana/mimir pass enabled=false
    to upstream subcharts).

Validation:

  - helm template render clean for all 4 charts.
  - `kubectl apply --dry-run=server` on t31 — all 4 NetworkPolicies
    accepted by the API server.
  - Verbatim render confirms the auto-emitted cluster name matches the
    label on the existing CNPG Pod (newapi-bp-newapi-newapi-pg).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 08:02:59 +04:00
github-actions[bot]
a92ef43beb deploy: bump sandbox-controller image to f442c28 2026-05-19 04:02:49 +00:00
github-actions[bot]
be2833cfb4 deploy: bump sandbox-mcp-server image to f442c28 2026-05-19 04:01:21 +00:00
hatiyildiz
48687ef24d deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.185 -> 1.4.186 (auto, Refs TBD-A6) 2026-05-19 04:01:11 +00:00
e3mrah
dfa17c1b98
fix(catalyst-cnp): allow egress to TCP/6443 for multi-region fan-out (#1908) (#1912)
TBD-A45 — baseline-default-deny CNP world-egress block previously
allowed only 443/587/465/25, so catalyst-api fan-out to secondary
kube-apiservers on TCP/6443 (D5/D16/D20) silently timed out on the
informer reflector List() call and returned primary-only results.

A152 diagnostic on t31 (3-region fresh prov):
  kubectl -n catalyst-system exec deploy/catalyst-api -- \
    nc -zvw 3 49.12.210.78 6443
  nc: connect to 49.12.210.78 port 6443 (tcp) timed out
vs. SAME endpoint from the bastion: open.

Fix:
- Add TCP/6443 to the world toEntities egress block in
  templates/network-policies/baseline-catalyst-system.yaml. World scope
  is correct per the OpenOva ClusterMesh model — inter-region link is
  always DMZ over public IPs, secondary api-server LB FQDNs are
  per-prov and unpredictable at chart-render time. Attack surface is
  bounded by TLS client-cert auth (only secondary-region kubeconfigs
  on the catalyst-api PVC hold valid certs).
- Extend tests/baseline-cnp-allowlist.sh (new Case 5b) so any future
  narrowing of this block fails Blueprint Release publish CI before
  the OCI artifact reaches a Sovereign.
- Bump chart 1.4.185 -> 1.4.186 with full Chart.yaml header changelog.

Real-cluster validation on t31 (primary, Cilium):
- kubectl apply -f rendered-cnp.yaml -> CNP patched
- nc from catalyst-api pod to 49.12.210.78:6443 -> open (was: timeout)
- nc from catalyst-api pod to 5.223.74.173:6443 -> open (was: timeout)
- catalyst-api rolled, new pod nc -> open (sticks across restarts)

chart/tests/baseline-cnp-allowlist.sh: 13/13 cases pass (was 12).

Closes #1908
Refs #1904 (this unblocks D5/D16/D20 fan-out RED)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 08:00:27 +04:00
e3mrah
f442c28174
fix(gitea-client): use POST /api/v1/orgs not /admin/orgs for org create (TBD-A43, Closes #1906) (#1910)
Gitea 1.22+ no longer routes POST /api/v1/admin/orgs — that path is
GET-only (admin list) and returns 405 with `Allow: GET`. The supported
create endpoint is POST /api/v1/orgs (org-create-as-self): the
authenticated principal owns the new Org. Because the
organization-controller authenticates with the Gitea admin token
(catalyst-gitea-token, owner=gitea_admin), the admin user owns each
tenant Org — same semantic as the legacy admin path.

Symptom on t31: catalyst-organization-controller loops on
"gitea.EnsureOrg: create: gitea: POST .../api/v1/admin/orgs: HTTP 405",
blocking D29 Step 7 (tenant Gitea Org provisioning).

Real Gitea API proof (t31, Gitea 1.22.3):
  - BEFORE: POST /api/v1/admin/orgs → 405 Method Not Allowed (Allow: GET)
  - AFTER:  POST /api/v1/orgs       → 201 Created
  - 422 on duplicate username → unchanged (still mapped to errAlreadyExists)

Closes #1906
Refs TBD-A43

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:59:08 +04:00
hatiyildiz
8b5cab3aae deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.24 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1) 2026-05-19 03:58:28 +00:00
hatiyildiz
11c70c6f14 deploy(bp-powerdns): bump bootstrap-kit pin -> 1.2.4 (auto, Refs TBD-A6, retry 1) 2026-05-19 03:58:05 +00:00
hatiyildiz
5e67c7c3f4 deploy(bp-keycloak): bump bootstrap-kit pin -> 1.4.6 (auto, Refs TBD-A6, retry 1) 2026-05-19 03:57:59 +00:00
hatiyildiz
8b1665a17c deploy(bp-openbao): bump bootstrap-kit pin -> 1.2.17 (auto, Refs TBD-A6, retry 2) 2026-05-19 03:57:57 +00:00
hatiyildiz
57fb4c2c23 deploy(bp-gitea): bump bootstrap-kit pin -> 1.2.8 (auto, Refs TBD-A6, retry 2) 2026-05-19 03:57:55 +00:00
hatiyildiz
03aa91eaa2 deploy(bp-grafana): bump bootstrap-kit pin -> 1.0.2 (auto, Refs TBD-A6, retry 1) 2026-05-19 03:57:53 +00:00
hatiyildiz
901fdcd635 deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.18 (auto, Refs TBD-A6, retry 1) 2026-05-19 03:57:48 +00:00
hatiyildiz
76101f621a deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.23 (auto, Refs TBD-A6, retry 1) 2026-05-19 03:57:44 +00:00
github-actions[bot]
8586fff4ac deploy: bump bp-newapi upstream v0.13.2 chart 1.4.24 2026-05-19 03:57:40 +00:00
e3mrah
0a45a790e7
fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909)
PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone
Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to
`https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)
when more than one parent zone is enabled. Every public HTTPRoute pinned
to `sectionName: https` got `Accepted=False NoMatchingListener` and the
hosted service 404'd / connection-refused.

That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes
shipped the same `sectionName: https` default in values.yaml, so on a
multi-zone Sovereign every blueprint route — gitea, grafana, harbor,
keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed
to attach. TBD-A40 / issue #1902.

Sweep verbatim:

  $ git grep -nE 'sectionName:[[:space:]]+(https|"https")[[:space:]]*$' \
      platform/*/chart/ products/ clusters/ core/ 2>/dev/null \
      | grep -v 'platform/gateway-api/chart/templates'
  platform/gitea/chart/values.yaml:168:    sectionName: https
  platform/grafana/chart/values.yaml:124:    sectionName: https
  platform/harbor/chart/values.yaml:437:    sectionName: https
  platform/keycloak/chart/values.yaml:482:    sectionName: https
  platform/newapi/chart/values.yaml:721:      sectionName: https
  platform/openbao/chart/values.yaml:72:    sectionName: https
  platform/powerdns/chart/values.yaml:407:      sectionName: https
  platform/stalwart-tenant/chart/values.yaml:297:      sectionName: https
  products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802:        sectionName: https

Fix (Option C — omit sectionName, same as PR #1888):

  - 8 blueprint values.yaml defaults flipped from `sectionName: https` to
    `sectionName: ""`. The chart templates already guard with `{{- with
    .Values.gateway.parentRef.sectionName }}`, so a blank value drops the
    field entirely and Cilium Gateway matches by hostname filter.

  - platform/newapi/chart/templates/httproute.yaml was the outlier: it
    used `default "https" $parent.sectionName` which fell back to `https`
    even when values.yaml said empty. Rewritten to `{{- with
    $parent.sectionName }}` so empty drops the field — same pattern as
    the other 7 blueprints.

  - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
    renders a per-tenant bp-keycloak HelmRelease and injected
    `sectionName: https` into spec.values. Flipped to `sectionName: ""`
    so the bp-keycloak chart's `{{- with }}` guard drops the field.

Validation (real `helm template`, default values, gateway enabled, no
sectionName override) — Principle #15:

  gitea            : sectionName lines in rendered output = 0
  grafana          : sectionName lines in rendered output = 0
  harbor           : sectionName lines in rendered output = 0
  keycloak         : sectionName lines in rendered output = 0
  openbao          : sectionName lines in rendered output = 0
  powerdns         : sectionName lines in rendered output = 0
  newapi           : sectionName lines in rendered output = 0
  stalwart-tenant  : sectionName lines in rendered output = 0

Override path preserved — `--set ...parentRef.sectionName=https-omani-works`
on each chart renders `sectionName: "https-omani-works"` correctly,
so operators on single-zone clusters or non-Cilium gateways can still
pin explicitly via bootstrap-kit overlay.

helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint
error is pre-existing on origin/main, unrelated to this fix).

Chart bumps (each blueprint also bumps blueprint.yaml spec.version per
#817 lockstep):
  bp-gitea            1.2.7  -> 1.2.8
  bp-grafana          1.0.1  -> 1.0.2
  bp-harbor           1.2.17 -> 1.2.18
  bp-keycloak         1.4.5  -> 1.4.6
  bp-newapi           1.4.22 -> 1.4.23
  bp-openbao          1.2.16 -> 1.2.17
  bp-powerdns         1.2.3  -> 1.2.4
  bp-stalwart-tenant  0.1.2  -> 0.1.3

Refs TBD-A40.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:57:12 +04:00
hatiyildiz
9657448a72 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.184 -> 1.4.185 (auto, Refs TBD-A6) 2026-05-19 03:34:36 +00:00
e3mrah
833214a5aa
fix(provisioning-rbac): grant create organizations.orgs.openova.io (TBD-A38, Closes #1900) (#1903)
A143 D29 walk on t31 caught the tenant.created Kafka consumer 403ing in
a 5s NAK-retry loop forever:

    403 Forbidden: system:serviceaccount:sme:provisioning cannot create
    resource "organizations" in API group "orgs.openova.io"

A29 PR #1860 shipped the Go consumer code that creates one Organization
CR per voucher checkout (D29 step 5) but did NOT bump the chart RBAC.
Step 5 fails -> steps 6/7/8 of the customer journey blocked.

Add to ClusterRole sme-provisioning:

  - apiGroups: ["orgs.openova.io"]
    resources: ["organizations"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Bump chart 1.4.184 -> 1.4.185.

Validation per Principle #15 (real kubectl auth can-i against t31, not jq grep):

  $ kubectl --kubeconfig=/tmp/t31-primary.kubeconfig auth can-i create \
      organizations.orgs.openova.io --as=system:serviceaccount:sme:provisioning
  Warning: resource 'organizations' is not namespace scoped in group 'orgs.openova.io'
  yes

Same `yes` for get / list / watch / update / patch / delete. Pre-fix
baseline was `no`. The ClusterRole was applied via `helm template . |
yq 'select(.kind==ClusterRole)' | kubectl apply -f -`, then can-i
re-run to confirm.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:33:58 +04:00
e3mrah
8535df6923
fix(sovereign-tls): cap Gateway annotations at 8 to satisfy gateway-api CRD (TBD-A36, Closes #1896, Refs #1897) (#1898)
PR #1889 added 10 Hetzner-LB annotations to `Gateway/cilium-gateway`
`spec.infrastructure.annotations`. The Gateway-API CRD declares
`maxProperties: 8` on that field, so Flux SSA rejected the manifest:

  spec.infrastructure.annotations: Too many: 10: must have at most 8 items

→ Gateway never reconciled → cilium-gateway-cilium-gateway Service stayed
ClusterIP → no Hetzner LB at the Service layer → public TLS at
console.<fqdn>:443 reset at the handshake. Blocked t28/t29/t30 since
2026-05-19 00:50:35Z.

Fix (Option A per A130): drop the two health-check timing annotations
(health-check-interval, health-check-timeout). hcloud-CCM defaults match
the values we were declaring (15s / 10s) so runtime health-check
behaviour is unchanged. The remaining 8 annotations are the minimum set
required to materialise a public-IP TCP-health-checked Hetzner LB on the
correct location/type with the correct backend port.

Validated with `kubectl apply --dry-run=server` against the mothership
cluster (Principle #15 — IaC evaluator over text grep) before merge.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 06:15:41 +04:00
e3mrah
4482428fa3
docs: add Principle 15 — validate IaC with the IaC evaluator, not Python/jq simulation (#1895)
PR #1892 (TBD-A32 listener wildcard depth) was admin-merged with
"verified via Python jsonencode() simulation" — but tofu HCL's
type-unification rule rejected the ternary at plan-time. Every new
prov failed at 23s. A128 hotfix (#1894) shipped with REAL tofu
validate evidence.

Codify the rule: for .tf/.tftpl use tofu validate / tofu plan; for
Helm use helm template piped to kubectl apply --dry-run=server; for
manifests use --dry-run=server (not client). Python json.dumps and
jq greps are theater — they accept structurally-different shapes
the IaC evaluator rejects.

Refs PR #1892, PR #1894 (A128 hotfix).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 05:37:56 +04:00
github-actions[bot]
6582bc031d deploy: update catalyst images to 20b502d 2026-05-19 01:35:32 +00:00
e3mrah
20b502d790
fix(infra/hetzner): drop tuple-shape conditional in per_prov_listeners (TBD-A35, Closes #1886) (#1894)
PR #1892 (TBD-A32 fix for shared-zone collision) introduced an HCL
"Inconsistent conditional result types" error at infra/hetzner/main.tf
line 468. Every fresh prov failed at tofu plan in 23s, e.g. A127 t29
attempt (deployment 4afd9ebceea92547) at 2026-05-19 01:08:41Z.

Root cause: `local.per_prov_listeners` was defined as

    local.parent_domains_includes_sovereign_fqdn ? [] : [HTTPS_obj, HTTP_obj]

HCL/tofu cannot unify the conditional arms: the true arm is `tuple([])`
(length 0) and the false arm is `tuple([obj_with_tls, obj_without_tls])`
(length 2). Even moving the conditional to the consumer line in
`concat()` did not fix it — the same length-0 vs length-2 tuple
unification still fails.

Fix: emit `per_prov_listeners` unconditionally as the 2-element tuple,
then suppress it at the `concat()` consumer with a for-iteration filter

    [for l in local.per_prov_listeners : l if !<collides>]

which always produces a list (length 0 or 2 — same element type), so HCL
never needs to unify two tuple types.

Validated locally with OpenTofu v1.8.5 against a minimal tfvars fixture:
- `tofu validate` → "Success! The configuration is valid."
- `tofu console` with sovereign_fqdn="t29.omani.works", parent="omani.works":
  emits 4 listeners (parent https/http for *.omani.works + per-prov
  https-t29-omani-works/http-t29-omani-works for *.t29.omani.works) —
  matches PR #1892's intent.
- `tofu console` with sovereign_fqdn="omani.works" (collision):
  emits 2 listeners (only parent https/http) — collision guard preserved.

No chart bump; this is a tofu-only change. Re-closes #1886 after #1892
re-opened it via the type-mismatch regression.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 05:33:35 +04:00
github-actions[bot]
1b31b85d42 deploy: update catalyst images to 0020ef8 2026-05-19 01:25:23 +00:00
e3mrah
0020ef8129
fix(catalyst-api): seed owner UserAccess at bake-time, not at handover (TBD-A34, Closes #1891) (#1893)
D21 (owner UserAccess CR) was previously only seeded by
auth_handover.go::seedOwnerUserAccess after a live PIN-login. The
zero-touch convergence verifier cannot drive a PIN-login from CI, so
D21 stayed RED on every fresh prov until an operator manually
authenticated — even though SOVEREIGN_FQDN + OPERATOR_EMAIL + the
UserAccess CRD are all stable on the chroot from bake-time onward.

This slice adds a bake-time goroutine in main() that calls the
existing handler.EnsureOwnerUserAccess against the in-cluster
dynamic client when:
  - the dynamic client is non-nil (in-cluster mode),
  - SOVEREIGN_FQDN env is set (chroot mode), and
  - OPERATOR_EMAIL env is set (orgEmail stamped via sovereign-fqdn
    ConfigMap).

Capped backoff (0/5/10/20/40s) tolerates the UserAccess CRD rolling
behind us. Idempotent — EnsureOwnerUserAccess folds AlreadyExists to
nil, so the existing handover-fired path still works without
regression. Each skip / converged / error path logs at Info or Warn
so an operator can confirm bake-time seeding from stdout without
scraping the CR.

Tests in cmd/api/main_test.go cover the happy path, all three skip
branches (nil client, empty SOVEREIGN_FQDN, empty OPERATOR_EMAIL),
and an idempotent re-run simulating Pod restart.

Refs A116 diagnostic; supersedes the handover-only seed path for
zero-touch verification.

Closes #1891

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 05:22:13 +04:00
github-actions[bot]
b34f56dd22 deploy: update catalyst images to 1da2162 2026-05-19 01:04:29 +00:00
e3mrah
1da216205a
fix(gateway): add per-prov 2-label wildcard listener for shared parent zones (Closes #1886, TBD-A32) (#1892)
The Cilium Gateway template emits `hostname: *.<parent-zone>` listeners
(e.g. `*.omani.works`). Per Gateway-API spec wildcard semantics that
matches EXACTLY one label depth, so `foo.omani.works` matches but
`console.t28.omani.works` does NOT. On every shared-parent-zone topology
(every per-prov Sovereign under omani.works) the operator-facing FQDN
is 2-label-deep — `curl -skI https://console.t28.omani.works/` reset at
TLS handshake even though `sovereign-wildcard-tls-t28-omani-works`
already contained all 13 per-prov SANs.

Fix: locals.per_prov_listeners in infra/hetzner/main.tf appends an extra
listener pair hostnamed `*.<sovereign_fqdn>` bound to the per-prov cert
`sovereign-wildcard-tls-<fqdn-dashed>` rendered by
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml. Skipped when
sovereign_fqdn equals one of the declared parent-zone names (legacy
single-zone-on-apex case) so no duplicate listener-name Conflict.

Verified by simulated jsonencode against three scenarios:

1. t28 multi-zone (sovereign_fqdn=t28.omani.works, parent_domains=
   [omani.works, omani.homes]) — emits 6 listeners:
     https-omani-works     hostname=*.omani.works     cert=sovereign-wildcard-tls-omani-works
     http-omani-works      hostname=*.omani.works
     https-omani-homes     hostname=*.omani.homes     cert=sovereign-wildcard-tls-omani-homes
     http-omani-homes      hostname=*.omani.homes
     https-t28-omani-works hostname=*.t28.omani.works cert=sovereign-wildcard-tls-t28-omani-works
     http-t28-omani-works  hostname=*.t28.omani.works

2. t28 single parent zone (sovereign_fqdn=t28.omani.works,
   parent_domains=[omani.works]) — emits 4 listeners (bare `https`/`http`
   for backward-compat with legacy sectionName HTTPRoutes + per-prov
   `https-t28-omani-works`/`http-t28-omani-works`).

3. Legacy apex (sovereign_fqdn=omani.works, parent_domains=
   [omani.works]) — collision guard active, emits only bare `https`/`http`.

All scenarios produce unique listener names.

Safe because every catalyst-system HTTPRoute now omits sectionName
(PR #1888 closing #1884) — Cilium attaches via hostname match, so the
per-prov 2-label listener catches `console.<fqdn>` / `api.<fqdn>` /
`marketplace.<fqdn>` / etc.

Refs A110 t28 scorecard, A107 D29 walk.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 05:02:36 +04:00
hatiyildiz
ae4ead480a deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.183 -> 1.4.184 (auto, Refs TBD-A6) 2026-05-19 00:55:00 +00:00
e3mrah
ed91f40d57
fix(sovereign-tls): wire Cilium Gateway listener at per-prov cert; stop parent-zone wildcard render (TBD-A29, Closes #1883) (#1890)
The Sovereign's Cilium Gateway listener `https-<parent-zone>` referenced
the parent-zone wildcard Secret `sovereign-wildcard-tls-<sanitised(parent)>`
(e.g. `sovereign-wildcard-tls-omani-works` for `*.omani.works`). That cert
is minted by `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml`
and SHARES Let's Encrypt's "5 New Certificates per Exact Set of Identifiers
per 168h" bucket with every other Sovereign on the same parent zone. After
~5 wipe+reprov cycles on `omani.works` the listener pinned to a
`Ready=False` Certificate (cert-manager spun the order forever, LE returned
`urn:ietf:params:acme:error:rateLimited`). A107 t28 evidence: per-prov cert
`sovereign-wildcard-tls-t28-omani-works` IS `Ready=True` but unused.

Fix (two parts):

1. `infra/hetzner/main.tf` — `parent_domains_listeners_yaml` now points
   each listener's `tls.certificateRefs[0].name` at the PER-PROV cert
   `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}` (rendered by
   `clusters/_template/sovereign-tls/cilium-gateway-cert.yaml` with the
   explicit SAN list `[console.<sovereign-fqdn>, auth.<sovereign-fqdn>,
   ..., sandbox.<sovereign-fqdn>]`). Per-prov identifier sets get their
   own 5/168h bucket per Sovereign so reprovs never share LE budget.
   New `local.sovereign_fqdn_dashed = replace(var.sovereign_fqdn, ".",
   "-")` is the SAME suffix `cilium-gateway-cert.yaml` /
   `cilium-envoy-tls-restart-job.yaml` already use, so the listener +
   cert + restart-job RBAC stay in lockstep.

2. `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` --
   skip-render unconditionally (`{{- if false }}` wrap around the
   `wildcardCert.enabled` guard). The parent-zone wildcards it minted
   are no longer referenced by anything and burn LE budget on every
   install. Template body kept for `git blame` / future revival under
   issue #831 (multi-listener per-zone tenant TLS with non-wildcard SAN
   lists). Removes 2 Certificate resources per multi-zone Sovereign.

Verification (helm template):

  helm template products/catalyst/chart \
      --set parentZones[0].name=omani.works --set parentZones[0].role=primary \
      --set parentZones[1].name=omani.homes --set parentZones[1].role=sme-pool \
      --set global.sovereignFQDN=t28.omani.works \
      --set wildcardCert.enabled=true \
    | grep -c 'sovereign-wildcard-cert'
  # before: 2  (two parent-zone Certificates rendered)
  # after:  0  (zero -- template skip-renders)

Chart bumped 1.4.182 -> 1.4.183 so the next Blueprint Release republishes
the OCI artifact with the skip-render change.

Hostname semantics unchanged: listener `hostname: *.<parent-zone>` still
matches any FQDN under the parent; cilium-envoy SNI dispatch serves the
per-prov cert whose SAN list covers the requested hostname (operator's
console/auth/gitea/etc. subdomains under `<sovereign-fqdn>`). Tenant
URLs under non-primary parent zones (`wp-foo.omani.homes`) remain out
of scope for A29; those need explicit per-tenant cert wiring via #831.

Closes #1883

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:54:18 +04:00