openova

Author	SHA1	Message	Date
e3mrah	177b4d74de	fix(bp-catalyst-platform): scope console HTTPRoute to console.<fqdn>, free auth.<fqdn> for Keycloak (#1925 ) TBD-A42 (issue #1905): the `tenant-wildcard` HTTPRoute in products/catalyst/chart/templates/sme-services/marketplace-routes.yaml claimed `.<global.sovereignFQDN>` and routed every match to sme/console:8080. On Cilium Gateway, the wildcard route shadowed exact-match platform HTTPRoutes (auth.<sov> -> keycloak, console.<sov> -> catalyst-ui, api.<sov> -> catalyst-api, pdns.<sov> -> powerdns, grafana.<sov> -> grafana, etc.) even though Gateway API spec section 5.2.1 says exact wins over wildcard. Admission-order-dependent precedence on t31 meant `auth.t31.omani.works` returned 4836B Astro HTML (SME console SPA) instead of Keycloak's login page, blocking D4 SSO PIN-bounce (#1807). Same precedence-collision family as A30/A40/A32. Fix: replace the single `tenant-wildcard` HTTPRoute with N explicit per-slug HTTPRoutes named `tenant-<slug>` with hostname `<slug>.<global.sovereignFQDN>` EXACT - no wildcard, no shadowing possible by construction. Slug list comes from a new operator-supplied `ingress.marketplace.tenantSlugs[]` value, default empty list. With the default, ZERO catch-all routes are emitted, so platform subdomains (auth/console/api/...) can NEVER be hijacked. Per-tenant routes for Orgs created post-provision continue to be written live by the organization-controller (templates/sme-services/ tenant-public-routes.yaml emits the byte-identical chart-side analogue), so the SaaS-tenant traffic path is unchanged for any Org the controller knows about. marketplace-reference-grant.yaml already covers catalyst-system -> sme/console - every new `tenant-<slug>` HTTPRoute is in catalyst-system pointing at sme/console, so no grant change is needed. Comment updated to note the wildcard->per-slug refactor. Verified on t32 2026-05-19: helm template ... --set ingress.marketplace.tenantSlugs={demo} \ \| kubectl apply --dry-run=server -> marketplace HTTPRoute configured + tenant-demo HTTPRoute created Before fix the same template emitted `tenant-wildcard` with `hostnames: [".t32.omani.works"]`; after fix, no catch-all is rendered and `auth.t32.omani.works` is reachable by Keycloak's exact-match HTTPRoute only. Files changed: - products/catalyst/chart/templates/sme-services/marketplace-routes.yaml - products/catalyst/chart/values.yaml - products/catalyst/chart/templates/sme-services/marketplace-reference-grant.yaml - products/catalyst/chart/Chart.yaml (1.4.189 -> 1.4.190) - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml (pin bump) Closes #1905 Closes #1807 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 11:12:53 +04:00
github-actions[bot]	9539c03b59	deploy: update catalyst images to `174fc70`	2026-05-19 07:09:18 +00:00
e3mrah	174fc703b1	fix(catalyst-cnp): add sme + newapi NS to baseline-default-deny egress (TBD-A43) (#1923 ) PR #1912 was theater for the D29 customer-journey blocker. It was titled "fix catalyst-system → sme/newapi egress" but only added world TCP/6443 and never extended `.Values.security.baselineCnp.allowedPlatformNamespaces`. t32 fresh-prov walk (af1da1e7, 2026-05-19) confirmed the live CNP still listed only [keycloak gitea powerdns cnpg-system openbao harbor nats-system loki mimir tempo alloy opentelemetry external-secrets-system cert-manager]. Console → `gateway.sme.svc:8080` returned 503 `context deadline exceeded`. Fix: append `sme` + `newapi` to the values default, extend `tests/baseline-cnp-allowlist.sh` with Cases 5c + 5d so any future narrowing fails Blueprint Release CI before the OCI artifact ships, bump Chart.yaml 1.4.188 → 1.4.189, bump bootstrap-kit pin 1.4.188 → 1.4.189. 15/15 chart-tests green (was 13). kubectl --dry-run=server validation passes. Closes #1920 Refs #1912 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 11:07:05 +04:00
hatiyildiz	18f7801759	deploy(bp-newapi): bump bootstrap-kit pin 1.4.27 -> 1.4.28 (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version 1.4.27 -> 1.4.28 (Refs TBD-A20, #1856).	2026-05-19 06:58:22 +00:00
github-actions[bot]	ecb0974704	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.28	2026-05-19 06:57:45 +00:00
e3mrah	472e8c69f9	fix(bp-newapi): consume CNPG-managed app secret instead of stale DSN (TBD-A39, Closes #1834 ) (#1921 ) * fix(bp-newapi): consume CNPG-managed app secret via sync-job (TBD-A39, Closes #1834) D34 close-audit on t32 (2026-05-19) found newapi-bp-newapi in 21x CrashLoopBackOff with `SASL auth: FATAL: password authentication failed for user "newapi"`. Public probe to `newapi.t32.omani.works` returned envoy 503 "no healthy upstream". Root cause: chart's templates/cnpg-cluster.yaml rendered the DSN Secret via Helm `lookup "v1" "Secret" .Release.Namespace <cluster>-app` at template time. On every freshly-franchised Sovereign CNPG materialises the `<cluster>-app` source Secret only AFTER bp-newapi's HelmRelease applies, so the first render's lookup returns nil and the chart commits the Secret with an empty password — literally `postgres://newapi:@newapi-bp-newapi-newapi-pg-rw.../newapi?sslmode=require`. The Secret carries `helm.sh/resource-policy: keep`, so Flux NEVER overwrites the empty bytes on subsequent reconciles even after CNPG populates the source. The chart's own header comment claims "the 1-minute Flux reconcile picks it up on the next tick" — verified false in production; `resource-policy: keep` pins the empty bytes. Fix: - platform/newapi/chart/templates/cnpg-cluster.yaml: drop the Helm `lookup` + DSN composition. The DSN Secret renders as a chart-managed empty placeholder so kubelet can satisfy the Deployment's secretKeyRef on first schedule (kubelet only checks the key EXISTS). - platform/newapi/chart/templates/database-secret-sync-job.yaml (NEW): Helm post-install/post-upgrade Job + ServiceAccount + Role + Binding. The Job polls `<cluster>-app` (up to 10 min via curl + in-pod SA token), reads the `password` bytes, composes the canonical `postgres://<user>:<password>@<host>:5432/<db>?sslmode=<mode>` string, and strategic-merge PATCHes it into the placeholder. Idempotent. - platform/newapi/chart/Chart.yaml: version 1.4.26 → 1.4.27 with full changelog block. - clusters/_template/bootstrap-kit/80-newapi.yaml: bp-newapi pin 1.4.26 → 1.4.27. Pattern lifted from platform/gitea/chart/templates/database-secret- sync-job.yaml (canonical seam — issue #830 Bug 2, proven on otech30) and platform/wordpress-tenant/chart/templates/database-secret-sync- job.yaml (issue #1786, proven on t26). Validation: - `helm dep update && helm template newapi .` renders cleanly with the placeholder Secret + Job + SA + Role + RoleBinding. - `kubectl apply --dry-run=server` against t32 apiserver accepts all 11 rendered objects (server dry run). Refs: TBD-A39 Closes: #1834 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-newapi): bump blueprint.yaml lockstep version to 1.4.27 Sync platform/newapi/blueprint.yaml spec.version with the Chart.yaml bump in the preceding commit. TestBootstrapKit_BlueprintVersionLockstep Sweep enforces these two stay aligned (TBD-A20, #1856). Refs: TBD-A39 Refs: #1834 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:57:20 +04:00
github-actions[bot]	7aa02a21b1	deploy: update catalyst images to `bf577e9`	2026-05-19 06:52:05 +00:00
e3mrah	bf577e9d7b	fix(bp-sme): allow egress from catalyst-system to gateway:8080 (TBD-A38, Closes #1917 ) (#1919 ) The baseline-default-deny CiliumNetworkPolicy in catalyst-system listed 14 platform namespaces in its egress allow-list (keycloak, gitea, powerdns, cnpg-system, openbao, harbor, nats-system, loki, mimir, tempo, alloy, opentelemetry, external-secrets-system, cert-manager) but did NOT include `sme`. The bp-sme-platform chart deploys the SME control-plane into namespace `sme`, and console in catalyst-system reaches `gateway.sme.svc.cluster.local:8080` for every voucher list / issue / redeem call (plus admin reaches the same gateway for tenant onboarding). Every such call was therefore dropped at the egress hook and timed out at 5s, surfaced at the operator as 503 `context deadline exceeded` on the voucher list / voucher issue panels. Reproduction on t32 (2026-05-19, fresh prov, READ-ONLY): $ kubectl exec -n catalyst-system catalyst-api-59d5cf5644-wrg4x \\ -- curl -m 5 http://gateway.sme.svc.cluster.local:8080/healthz 000 time=5.002937 curl: (28) Connection timed out after 5002 milliseconds Live CNP egress excerpt (kubectl get cnp -n catalyst-system baseline-default-deny -o yaml \| yq '.spec.egress[3]'): toEndpoints: - matchExpressions: - key: k8s:io.kubernetes.pod.namespace operator: In values: - keycloak ... - cert-manager # (no 'sme') Fix: add `sme` to BOTH the values.yaml default (`.Values.security.baselineCnp.allowedPlatformNamespaces`) AND the template's `default (list ...)` fallback, so a Helm install with no values overrides still renders the allow. Originally masqueraded under #1748 (voucher list 503) and #1749 (voucher issue 503) — those were thought to be services-build 502 regressions, but this is a distinct CNP-misconfig bug class. Validation: - `helm template` confirms rendered CNP now lists `sme` in egress. - `kubectl apply --dry-run=server` against t32 apiserver passes ("ciliumnetworkpolicy.cilium.io/baseline-default-deny configured"). Chart bumped 1.4.188 → 1.4.189; bootstrap-kit pin bumped to match. No live patching on t32 — fix verified via server-side dry-run only, per Principle #15. Closes #1917 Refs #1748 Refs #1749 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 10:49:47 +04:00
e3mrah	446da60ca4	feat(catalyst-api): publish catalyst.tenant.sandbox_requested on Sandbox create (#1918 ) Adds a NATS-publish hook to HandleCreateSandboxSession so every successful Sandbox CR Create emits a canonical `catalyst.tenant.sandbox_requested` event. Sandbox-controller already consumes this subject (core/controllers/sandbox/internal/controller/ nats_bridge.go) and tenant-service's SandboxOrchestrator publishes it from the CRM side, but the catalyst-api FE-driven create path was silently bypassing the audit stream — the symptom #1776 calls out. Surface added: - TenantEvent payload {tenant_id, sandbox_id, requested_by, timestamp, spec_hash} matching the existing audit.Event field naming convention. spec_hash is SHA-256 over the canonical JSON-serialised .spec for drift detection. - TenantEventPublisher interface on the Handler (nil-tolerant: when unset the publish-side is a no-op so CI without CATALYST_NATS_URL still passes; production wiring binds a real publisher). - SetTenantEventPublisher setter mirroring SetAuditBus. - Constant SandboxRequestedSubject = "catalyst.tenant.sandbox_requested" so producer + consumer + tests share one symbol. Wiring: - main.go: newTenantEventPublisherFromEnv placeholder identical in shape to newRBACAuditPublisherFromEnv. Returns nil today because catalyst-api ships without nats.go in go.mod; the real publisher lands in the same follow-up slice that swaps the RBAC stub. CATALYST_NATS_URL gates the wiring; CATALYST_TENANT_NATS_SUBJECT_ PREFIX lets operators override the canonical prefix per INVIOLABLE-PRINCIPLES.md #4. Tests (6 new in sandbox_sessions_nats_test.go): - PublishesSandboxRequested: happy-path — exactly one publish on the canonical subject with all fields populated. - NoPublisher_DoesNotFail: nil-tolerant — Sandbox Create still 201s when no publisher is wired (CI, chroot). - PublishError_DoesNotFailRequest: a NATS outage logs + continues; the HTTP response stays 201 since the CR write already succeeded. - PublishUsesNamespaceWhenOrgEmpty: single-tenant chroot fallback — tenant_id falls back to the namespace (NOT the orgSlug, which collapses to "default" and would conflate every chroot). - PublishUsesSubWhenEmailEmpty: requested_by falls back to claims.Sub so the field is never blank. - SpecHash_DeterministicAcrossMapOrder: spec_hash stable across map iteration; changes when spec changes. - Subject_MatchesIssueContract: pins the exact subject string per #1776 against accidental drift. Sandbox-controller's consumer list (nats_bridge.go) already includes this subject — no controller-side change required. Closes #1776 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 10:48:18 +04:00
e3mrah	f6334cd023	fix(bp-gitea+bp-harbor): shorten mirror interval to 5m for post-cutover freshness (TBD-A37, Closes #1899 ) (#1916 ) * fix(bp-self-sovereign-cutover): post-cutover mirror re-sync CronJob (TBD-A37, Closes #1899) Step-01 (gitea-mirror) only runs ONCE at cutover and produces a STANDALONE local Gitea repo (PR #1029 — pull-mirror semantics block Step-06's HelmRepository URL rewrite push). Without an ongoing re-sync, upstream chart bumps merged AFTER cutover never reach the Sovereign. Live regression on t31 2026-05-19 (A145 verifier): sandbox-controller stuck at image :8017700 from 2026-05-16 even though PR #1862 had merged 2 days earlier with the NATS consume-leg — the upstream values.yaml bump never crossed the seam. This chart bump adds a gitea-mirror-resync CronJob (default schedule "/5 * * ") that fires the same idempotent bare-clone + push --mirror --force as Step-01 step (3) every 5 minutes. Pre-cutover fires are no-ops (the script detects the local repo is missing / empty and exits 0); post-cutover fires close the upstream → local Gitea loop. Why CronJob, not Gitea pull-mirror revival? PR #1029 documented why Gitea pull-mirror was abandoned: pull-mirror repos are read-only, blocking Step-06's HelmRepository URL rewrite push. We need a writable local repo that ALSO refreshes from upstream — the natural shape is a periodic force-push from a separate Job. Why CronJob, not push-from-upstream webhook? Slower to implement (requires GitHub App + webhook receiver on each Sovereign + DNS for the webhook URL). Tracked as a future evolution once stable; the CronJob is the minimal correct fix today. Default 5m cadence covers the chart-bump → upstream-merge → Sovereign-reconcile loop in ~10 min end-to-end while staying well under GitHub anonymous-clone rate limits (300 req/hr per IP; one Sovereign = 12 clones/hr). Per-Sovereign overlay knobs: .Values.mirrorResync.schedule (cron string) .Values.mirrorResync.suspended (bool, default false) .Values.mirrorResync.jobTimeoutSeconds (default 900) No new RBAC — the CronJob re-uses the existing cutover runner SA and the reflector-mirrored gitea-admin-secret that Step-01 already mounts. concurrencyPolicy: Forbid + startingDeadlineSeconds: 60 keep parallel runs / replay storms harmless. Verification: - helm template test . renders cleanly (2509 lines, +52 from 0.1.32) - tests/cutover-contract.sh all 20 gates GREEN (CronJob doesn't carry the cutover-step labels so the "exactly 9 step ConfigMaps" assertion still passes) - scripts/check-bootstrap-kit-pin-sync.sh PASS (50 chart→pin pairs) Chart 0.1.32 → 0.1.33; bootstrap-kit pin in clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml bumped to match. Closes #1899 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(bp-self-sovereign-cutover): bump blueprint.yaml lockstep to 0.1.33 TBD-A20 BlueprintVersionLockstepSweep CI gate caught the missing blueprint.yaml bump on PR #1916 (the chart Chart.yaml was bumped to 0.1.33 but blueprint.yaml still pinned 0.1.32). Bringing the two in lockstep so the test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:42:11 +04:00
hatiyildiz	ba4c2687f5	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.187 -> 1.4.188 (auto, Refs TBD-A6)	2026-05-19 06:40:15 +00:00
github-actions[bot]	1bb2e4b481	deploy: update sme service images to `cbfb3ad` + bump chart to 1.4.188	2026-05-19 06:39:37 +00:00
e3mrah	84ebcbeacf	fix(catalyst-chart): propagate SMTP_USER/SMTP_PASS into notification Pod (TBD-X1) (#1915 ) * fix(catalyst-chart): propagate SMTP_USER/SMTP_PASS into notification Pod (TBD-X1, Refs #1793) Wave 35 SMTP diagnostic root cause: notification.yaml only mounted SMTP_HOST / SMTP_PORT / SMTP_FROM from sme-secrets, so the Go net/smtp client dialed Stalwart without authentication. Stalwart's submission listener rejected every message with 503 5.5.1 "You must authenticate first" -> the (pre-companion-PR) fixed-60s retry storm slammed the relay 3x per message x 5 tenants and tripped Stalwart's [5 requests, 1000ms] rate-limiter for every tenant on the same relay. Fix is a one-symmetry-line with auth.yaml, which has consumed SMTP_USER and SMTP_PASS from sme-secrets since chart 1.4.20 (issue #934). This template was an oversight from the same change-set. The canonical SMTP-credentials propagation chain is already in place and unchanged here: mothership catalyst-openova-kc-credentials (key: smtp-user/smtp-pass) -> sovereign_smtp_seed.go SeedSovereignSMTPCredentials creates catalyst-system/sovereign-smtp-credentials on the new Sovereign (Phase-1, idempotent) -> sme-secrets.yaml lookup with source-wins precedence reads smtp-user / smtp-pass and emits SMTP_USER / SMTP_PASS keys in the per-tenant sme-secrets Secret -> auth.yaml AND (now, this PR) notification.yaml mount those two keys via secretKeyRef -> services-notification main.go reads SMTP_USER + SMTP_PASS via getEnv() -> buildAuth wires smtp.PlainAuth on every Send (companion PR services-notification smtp.go). Chart version bump 1.4.186 -> 1.4.187 per chart-release discipline. helm template test-render products/catalyst/chart \ --set ingress.marketplace.enabled=true \| grep SMTP_USER -A2 ... shows both auth.yaml AND notification.yaml mount SMTP_USER from sme-secrets keyed SMTP_USER (verified). Companion PR: services-notification smtp.go upgrade to exponential backoff + 3-in-90s circuit breaker so a future credential gap surfaces loudly via ErrCircuitOpen and never restarts a rate-limiter storm. Refs #1793 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): bump bp-catalyst-platform pin 1.4.186 -> 1.4.187 (TBD-X1, Refs #1793) Chart bump in the previous commit changed Chart.yaml version: 1.4.186 -> 1.4.187 (TBD-X1 SMTP_USER/SMTP_PASS wiring). The pin-sync-audit CI step caught the lockstep drift -- bootstrap-kit HelmRelease.spec.chart.spec.version MUST match the chart's Chart.yaml version exactly (see clusters/_template/bootstrap-kit/ 13-bp-catalyst-platform.yaml header comment + feedback_21_principles). Refs #1793 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <claude@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:38:29 +04:00
e3mrah	cbfb3adfbe	fix(notification): exponential backoff + circuit breaker on 503 5.5.1 SMTP rate-limit (TBD-X1, Refs #1793 ) (#1914 ) Wave 35 SMTP diagnostic root cause: sme-secrets lost SMTP_USERNAME / SMTP_PASSWORD after sme stack redeploy. Notification pod's net/smtp falls back to no-auth (Mailer.Auth was always nil, and main.go never read SMTP_USER/SMTP_PASS from env) -> Stalwart returns 503 5.5.1 "You must authenticate first" -> the prior fixed-60s retry loop slammed the relay 3x per message x 5 tenants and tripped Stalwart's [5 requests, 1000ms] rate-limiter for the whole submission listener. This PR fixes the retry behaviour and surfaces auth state loudly: 1. Mailer.Auth now wired via smtp.PlainAuth(SMTP_USER, SMTP_PASS, host) read from env in NewMailer. Either-or-neither is a slog.Warn + fall back to no-auth (so the next 503 5.5.1 is the LOUD error path instead of a silent half-broken creds). 2. Retry backoff is now exponential with a 30s floor (per issue spec TBD-X1) and a 5-minute cap: 30s -> 60s -> 120s -> 240s -> 300s (cap). Replaces the prior fixed 60s wait. 3. Circuit breaker (issue spec): 3 consecutive 503 5.5.1 responses inside a 90s sliding window open the breaker. While open, Send() short-circuits to ErrCircuitOpen for 120s cooldown -> the notification consumer NACKs / dead-letters instead of slamming a known-rate-limited relay. Window-aging means slow drips never trip; a single 250 OK between storms resets the consecutive counter via breakerResetOnSuccess. All paths are test-seamed (sendMail / sleep / now). Tests cover: - single-retry success keeps base backoff - exponential doubling 30s -> 60s - MaxBackoff cap on long storms - breaker trips at exactly trip-th hit and aborts the in-flight retry - short-circuit on subsequent Send while open - cooldown elapses -> breaker re-closes via fakeNow advance - slow-drip 503s age out of window and never trip - non-rate-limit errors still pass through immediately (no retry) - env-var parsing 30s floor preserved - buildAuth half-config / both / neither matrix go test ./core/services/notification/...: ok Deployment-side wiring (the notification.yaml chart template gaining SMTP_USER + SMTP_PASS env from sme-secrets) ships in a separate PR. Refs #1793 Co-authored-by: hatiyildiz <claude@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:38:22 +04:00
github-actions[bot]	298d404632	deploy: update catalyst images to `618273c`	2026-05-19 04:40:37 +00:00
e3mrah	618273c484	fix(catalyst-api): bake-time top-up of canonical .omani.X sme-pool (TBD-A44, Closes #1907 ) (#1913 ) PR #1861 widened LoadSMETenantParentDomainsFromEnv to seed all four canonical .omani.X TLDs (homes, rest, trade, works), but on a real Sovereign that env-stub fallback path is BYPASSED. The mothership imports a full deployment record with only the operator-selected sme-pool entry, and GET /api/v1/sovereign/parent-domains reads from the imported record (dep.Request.ParentDomains), not the env stub. Result on t31 (2026-05-19, c703247a0de12508): the on-disk record holds 1 primary (omani.works) + 1 sme-pool (omani.homes) = 2 rows. /parent-domains?role=sme-pool returns 1 entry instead of 4. A customer picking .omani.rest or .omani.trade on the marketplace /addons subdomain picker — both options the UI hard-codes — fails SME tenant signup with 422 invalid-parent-domain. Fix shape (same pattern as PR #1893 / D21 owner UserAccess bake-time seed): on every chroot-mode catalyst-api startup AND on every fresh handover import, top up Request.ParentDomains with any missing canonical TLD as role=sme-pool. Idempotent (a re-run is a no-op when the pool is already full); mothership mode (SOVEREIGN_FQDN unset) is a hard no-op; persists to disk so a Pod restart sees the topped-up shape. Dedup is against existing role=sme-pool rows only — a role=primary row on the same name does NOT count, because the customer-facing /addons picker validates against role=sme-pool entries via FindParentDomain. The t31 shape (primary=omani.works AND sme-pool=omani.works needed) is the real-world case. Wired into two seams so a fresh prov AND a Pod restart both converge: HandleDeploymentImport (post-import, fresh prov) and restoreFromStore (per-record rehydration, Pod restart). Five guards in chroot_parent_domains_seed_test.go: AllowedTLDs lockstep, top-up shape (mirrors t31), idempotence, mothership no-op, nil-dep. Drive-by: fixed a pre-existing build break in sme_tenant_gitops.go's smeTenantBPKeycloak raw-string constant (PR #1909 introduced literal backticks + a Go template action inside a YAML comment; the action confused text/template at render time → bp-keycloak.yaml render returned `unexpected EOF`). Replaced with prose that describes the chart template behaviour without inlining the template literal. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 08:38:24 +04:00
hatiyildiz	5d8a9c2a4f	deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.26 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1)	2026-05-19 04:04:07 +00:00
hatiyildiz	a100f82d27	deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.25 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1)	2026-05-19 04:03:48 +00:00
hatiyildiz	d1bb5758da	deploy(bp-openmeter): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1)	2026-05-19 04:03:39 +00:00
hatiyildiz	6d38089895	deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.19 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 2)	2026-05-19 04:03:38 +00:00
hatiyildiz	707563bc52	deploy(bp-matrix): lockstep blueprint.yaml spec.version -> 1.0.1 (auto, Refs TBD-A20, retry 1)	2026-05-19 04:03:36 +00:00
github-actions[bot]	dee7703413	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.26	2026-05-19 04:03:27 +00:00
e3mrah	59980125ed	fix(networkpolicy): egress to CNPG data-plane Pods, not cnpg-system operator NS (TBD-A39, Closes #1901 ) (#1911 ) The CNPG operator runs in the `cnpg-system` namespace, but the actual Postgres workload Pods reconcile into the same namespace as the CNPG `Cluster` CR — for the auto-provisioned-DB blueprints that's `.Release.Namespace` (e.g. `newapi`, `harbor`). A NetworkPolicy egress rule that namespace-selects on `cnpg-system` reaches the operator pods only, NOT the Postgres workloads — every 5432 connection times out. Verified live on t31: `newapi-bp-newapi-newapi-pg-1` runs in `newapi` ns with label `cnpg.io/cluster=newapi-bp-newapi-newapi-pg`, while `newapi-bp-newapi-…` is stuck 1/2 Ready with 20 restarts because its egress NP allows 5432 only to `cnpg-system`. Fix: every affected NP now selects the Postgres workload Pods by the operator-emitted `cnpg.io/cluster=<clusterName>` Pod label — namespace- agnostic, survives the operator namespace being different from the data-plane namespace. Charts fixed (4): - bp-newapi (1.4.22 → 1.4.23) — auto-provisions CNPG Cluster in `.Release.Namespace`. Removed the bogus `namespaceLabel: cnpg-system` egress entry from values.yaml; added a podSelector-based rule (cnpg.io/cluster=<release>-bp-newapi-newapi-pg) directly in the template, gated by `.Values.cnpg.enabled`. - bp-harbor (1.2.17 → 1.2.18) — Cluster CR in `postgres.cluster.namespace \| default .Release.Namespace` (default `harbor`). Changed egress from namespaceSelector=cnpg to podSelector cnpg.io/cluster=<postgres.cluster.name\|default harbor-pg>. - bp-matrix (1.0.0 → 1.0.1) — chart points at matrix-postgres-rw.matrix.svc.cluster.local (Cluster CR in `.Release.Namespace`). Replaced `cnpgNamespace` value with `cnpgClusterName` (default `matrix-postgres`) and switched egress rule to podSelector. - bp-openmeter (1.0.0 → 1.0.1) — operator-supplied CNPG endpoint pattern. Replaced `cnpgNamespace` with `cnpgClusterName` (default `openmeter-pg`) and switched egress rule to podSelector. Same pattern as matrix. Audited and clean: - bp-cnpg-pair: already uses podSelectors throughout. - bp-wordpress-tenant: cnpgNamespaceLabel="" path resolves to `.Release.Namespace` via the `cnpgNamespace` helper. - bp-llm-gateway: already pod-selects on `cnpg.io/cluster=bp-llm-gateway-audit`. - bp-keycloak / bp-gitea / bp-grafana / bp-mimir: no own networkpolicy.yaml template (grafana/mimir pass enabled=false to upstream subcharts). Validation: - helm template render clean for all 4 charts. - `kubectl apply --dry-run=server` on t31 — all 4 NetworkPolicies accepted by the API server. - Verbatim render confirms the auto-emitted cluster name matches the label on the existing CNPG Pod (newapi-bp-newapi-newapi-pg). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 08:02:59 +04:00
github-actions[bot]	a92ef43beb	deploy: bump sandbox-controller image to `f442c28`	2026-05-19 04:02:49 +00:00
github-actions[bot]	be2833cfb4	deploy: bump sandbox-mcp-server image to `f442c28`	2026-05-19 04:01:21 +00:00
hatiyildiz	48687ef24d	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.185 -> 1.4.186 (auto, Refs TBD-A6)	2026-05-19 04:01:11 +00:00
e3mrah	dfa17c1b98	fix(catalyst-cnp): allow egress to TCP/6443 for multi-region fan-out (#1908 ) (#1912 ) TBD-A45 — baseline-default-deny CNP world-egress block previously allowed only 443/587/465/25, so catalyst-api fan-out to secondary kube-apiservers on TCP/6443 (D5/D16/D20) silently timed out on the informer reflector List() call and returned primary-only results. A152 diagnostic on t31 (3-region fresh prov): kubectl -n catalyst-system exec deploy/catalyst-api -- \ nc -zvw 3 49.12.210.78 6443 nc: connect to 49.12.210.78 port 6443 (tcp) timed out vs. SAME endpoint from the bastion: open. Fix: - Add TCP/6443 to the world toEntities egress block in templates/network-policies/baseline-catalyst-system.yaml. World scope is correct per the OpenOva ClusterMesh model — inter-region link is always DMZ over public IPs, secondary api-server LB FQDNs are per-prov and unpredictable at chart-render time. Attack surface is bounded by TLS client-cert auth (only secondary-region kubeconfigs on the catalyst-api PVC hold valid certs). - Extend tests/baseline-cnp-allowlist.sh (new Case 5b) so any future narrowing of this block fails Blueprint Release publish CI before the OCI artifact reaches a Sovereign. - Bump chart 1.4.185 -> 1.4.186 with full Chart.yaml header changelog. Real-cluster validation on t31 (primary, Cilium): - kubectl apply -f rendered-cnp.yaml -> CNP patched - nc from catalyst-api pod to 49.12.210.78:6443 -> open (was: timeout) - nc from catalyst-api pod to 5.223.74.173:6443 -> open (was: timeout) - catalyst-api rolled, new pod nc -> open (sticks across restarts) chart/tests/baseline-cnp-allowlist.sh: 13/13 cases pass (was 12). Closes #1908 Refs #1904 (this unblocks D5/D16/D20 fan-out RED) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 08:00:27 +04:00
e3mrah	f442c28174	fix(gitea-client): use POST /api/v1/orgs not /admin/orgs for org create (TBD-A43, Closes #1906 ) (#1910 ) Gitea 1.22+ no longer routes POST /api/v1/admin/orgs — that path is GET-only (admin list) and returns 405 with `Allow: GET`. The supported create endpoint is POST /api/v1/orgs (org-create-as-self): the authenticated principal owns the new Org. Because the organization-controller authenticates with the Gitea admin token (catalyst-gitea-token, owner=gitea_admin), the admin user owns each tenant Org — same semantic as the legacy admin path. Symptom on t31: catalyst-organization-controller loops on "gitea.EnsureOrg: create: gitea: POST .../api/v1/admin/orgs: HTTP 405", blocking D29 Step 7 (tenant Gitea Org provisioning). Real Gitea API proof (t31, Gitea 1.22.3): - BEFORE: POST /api/v1/admin/orgs → 405 Method Not Allowed (Allow: GET) - AFTER: POST /api/v1/orgs → 201 Created - 422 on duplicate username → unchanged (still mapped to errAlreadyExists) Closes #1906 Refs TBD-A43 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:59:08 +04:00
hatiyildiz	8b5cab3aae	deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.24 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 1)	2026-05-19 03:58:28 +00:00
hatiyildiz	11c70c6f14	deploy(bp-powerdns): bump bootstrap-kit pin -> 1.2.4 (auto, Refs TBD-A6, retry 1)	2026-05-19 03:58:05 +00:00
hatiyildiz	5e67c7c3f4	deploy(bp-keycloak): bump bootstrap-kit pin -> 1.4.6 (auto, Refs TBD-A6, retry 1)	2026-05-19 03:57:59 +00:00
hatiyildiz	8b1665a17c	deploy(bp-openbao): bump bootstrap-kit pin -> 1.2.17 (auto, Refs TBD-A6, retry 2)	2026-05-19 03:57:57 +00:00
hatiyildiz	57fb4c2c23	deploy(bp-gitea): bump bootstrap-kit pin -> 1.2.8 (auto, Refs TBD-A6, retry 2)	2026-05-19 03:57:55 +00:00
hatiyildiz	03aa91eaa2	deploy(bp-grafana): bump bootstrap-kit pin -> 1.0.2 (auto, Refs TBD-A6, retry 1)	2026-05-19 03:57:53 +00:00
hatiyildiz	901fdcd635	deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.18 (auto, Refs TBD-A6, retry 1)	2026-05-19 03:57:48 +00:00
hatiyildiz	76101f621a	deploy(bp-newapi): bump bootstrap-kit pin -> 1.4.23 (auto, Refs TBD-A6, retry 1)	2026-05-19 03:57:44 +00:00
github-actions[bot]	8586fff4ac	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.24	2026-05-19 03:57:40 +00:00
e3mrah	0a45a790e7	fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902 ) (#1909 ) PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to `https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`) when more than one parent zone is enabled. Every public HTTPRoute pinned to `sectionName: https` got `Accepted=False NoMatchingListener` and the hosted service 404'd / connection-refused. That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes shipped the same `sectionName: https` default in values.yaml, so on a multi-zone Sovereign every blueprint route — gitea, grafana, harbor, keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed to attach. TBD-A40 / issue #1902. Sweep verbatim: $ git grep -nE 'sectionName:[[:space:]]+(https\|"https")[[:space:]]$' \ platform//chart/ products/ clusters/ core/ 2>/dev/null \ \| grep -v 'platform/gateway-api/chart/templates' platform/gitea/chart/values.yaml:168: sectionName: https platform/grafana/chart/values.yaml:124: sectionName: https platform/harbor/chart/values.yaml:437: sectionName: https platform/keycloak/chart/values.yaml:482: sectionName: https platform/newapi/chart/values.yaml:721: sectionName: https platform/openbao/chart/values.yaml:72: sectionName: https platform/powerdns/chart/values.yaml:407: sectionName: https platform/stalwart-tenant/chart/values.yaml:297: sectionName: https products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802: sectionName: https Fix (Option C — omit sectionName, same as PR #1888): - 8 blueprint values.yaml defaults flipped from `sectionName: https` to `sectionName: ""`. The chart templates already guard with `{{- with .Values.gateway.parentRef.sectionName }}`, so a blank value drops the field entirely and Cilium Gateway matches by hostname filter. - platform/newapi/chart/templates/httproute.yaml was the outlier: it used `default "https" $parent.sectionName` which fell back to `https` even when values.yaml said empty. Rewritten to `{{- with $parent.sectionName }}` so empty drops the field — same pattern as the other 7 blueprints. - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go renders a per-tenant bp-keycloak HelmRelease and injected `sectionName: https` into spec.values. Flipped to `sectionName: ""` so the bp-keycloak chart's `{{- with }}` guard drops the field. Validation (real `helm template`, default values, gateway enabled, no sectionName override) — Principle #15: gitea : sectionName lines in rendered output = 0 grafana : sectionName lines in rendered output = 0 harbor : sectionName lines in rendered output = 0 keycloak : sectionName lines in rendered output = 0 openbao : sectionName lines in rendered output = 0 powerdns : sectionName lines in rendered output = 0 newapi : sectionName lines in rendered output = 0 stalwart-tenant : sectionName lines in rendered output = 0 Override path preserved — `--set ...parentRef.sectionName=https-omani-works` on each chart renders `sectionName: "https-omani-works"` correctly, so operators on single-zone clusters or non-Cilium gateways can still pin explicitly via bootstrap-kit overlay. helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint error is pre-existing on origin/main, unrelated to this fix). Chart bumps (each blueprint also bumps blueprint.yaml spec.version per #817 lockstep): bp-gitea 1.2.7 -> 1.2.8 bp-grafana 1.0.1 -> 1.0.2 bp-harbor 1.2.17 -> 1.2.18 bp-keycloak 1.4.5 -> 1.4.6 bp-newapi 1.4.22 -> 1.4.23 bp-openbao 1.2.16 -> 1.2.17 bp-powerdns 1.2.3 -> 1.2.4 bp-stalwart-tenant 0.1.2 -> 0.1.3 Refs TBD-A40. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:57:12 +04:00
hatiyildiz	9657448a72	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.184 -> 1.4.185 (auto, Refs TBD-A6)	2026-05-19 03:34:36 +00:00
e3mrah	833214a5aa	fix(provisioning-rbac): grant create organizations.orgs.openova.io (TBD-A38, Closes #1900 ) (#1903 ) A143 D29 walk on t31 caught the tenant.created Kafka consumer 403ing in a 5s NAK-retry loop forever: 403 Forbidden: system:serviceaccount:sme:provisioning cannot create resource "organizations" in API group "orgs.openova.io" A29 PR #1860 shipped the Go consumer code that creates one Organization CR per voucher checkout (D29 step 5) but did NOT bump the chart RBAC. Step 5 fails -> steps 6/7/8 of the customer journey blocked. Add to ClusterRole sme-provisioning: - apiGroups: ["orgs.openova.io"] resources: ["organizations"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] Bump chart 1.4.184 -> 1.4.185. Validation per Principle #15 (real kubectl auth can-i against t31, not jq grep): $ kubectl --kubeconfig=/tmp/t31-primary.kubeconfig auth can-i create \ organizations.orgs.openova.io --as=system:serviceaccount:sme:provisioning Warning: resource 'organizations' is not namespace scoped in group 'orgs.openova.io' yes Same `yes` for get / list / watch / update / patch / delete. Pre-fix baseline was `no`. The ClusterRole was applied via `helm template . \| yq 'select(.kind==ClusterRole)' \| kubectl apply -f -`, then can-i re-run to confirm. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:33:58 +04:00
e3mrah	8535df6923	fix(sovereign-tls): cap Gateway annotations at 8 to satisfy gateway-api CRD (TBD-A36, Closes #1896 , Refs #1897 ) (#1898 ) PR #1889 added 10 Hetzner-LB annotations to `Gateway/cilium-gateway` `spec.infrastructure.annotations`. The Gateway-API CRD declares `maxProperties: 8` on that field, so Flux SSA rejected the manifest: spec.infrastructure.annotations: Too many: 10: must have at most 8 items → Gateway never reconciled → cilium-gateway-cilium-gateway Service stayed ClusterIP → no Hetzner LB at the Service layer → public TLS at console.<fqdn>:443 reset at the handshake. Blocked t28/t29/t30 since 2026-05-19 00:50:35Z. Fix (Option A per A130): drop the two health-check timing annotations (health-check-interval, health-check-timeout). hcloud-CCM defaults match the values we were declaring (15s / 10s) so runtime health-check behaviour is unchanged. The remaining 8 annotations are the minimum set required to materialise a public-IP TCP-health-checked Hetzner LB on the correct location/type with the correct backend port. Validated with `kubectl apply --dry-run=server` against the mothership cluster (Principle #15 — IaC evaluator over text grep) before merge. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 06:15:41 +04:00
e3mrah	4482428fa3	docs: add Principle 15 — validate IaC with the IaC evaluator, not Python/jq simulation (#1895 ) PR #1892 (TBD-A32 listener wildcard depth) was admin-merged with "verified via Python jsonencode() simulation" — but tofu HCL's type-unification rule rejected the ternary at plan-time. Every new prov failed at 23s. A128 hotfix (#1894) shipped with REAL tofu validate evidence. Codify the rule: for .tf/.tftpl use tofu validate / tofu plan; for Helm use helm template piped to kubectl apply --dry-run=server; for manifests use --dry-run=server (not client). Python json.dumps and jq greps are theater — they accept structurally-different shapes the IaC evaluator rejects. Refs PR #1892, PR #1894 (A128 hotfix). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 05:37:56 +04:00
github-actions[bot]	6582bc031d	deploy: update catalyst images to `20b502d`	2026-05-19 01:35:32 +00:00
e3mrah	20b502d790	fix(infra/hetzner): drop tuple-shape conditional in per_prov_listeners (TBD-A35, Closes #1886 ) (#1894 ) PR #1892 (TBD-A32 fix for shared-zone collision) introduced an HCL "Inconsistent conditional result types" error at infra/hetzner/main.tf line 468. Every fresh prov failed at tofu plan in 23s, e.g. A127 t29 attempt (deployment 4afd9ebceea92547) at 2026-05-19 01:08:41Z. Root cause: `local.per_prov_listeners` was defined as local.parent_domains_includes_sovereign_fqdn ? [] : [HTTPS_obj, HTTP_obj] HCL/tofu cannot unify the conditional arms: the true arm is `tuple([])` (length 0) and the false arm is `tuple([obj_with_tls, obj_without_tls])` (length 2). Even moving the conditional to the consumer line in `concat()` did not fix it — the same length-0 vs length-2 tuple unification still fails. Fix: emit `per_prov_listeners` unconditionally as the 2-element tuple, then suppress it at the `concat()` consumer with a for-iteration filter [for l in local.per_prov_listeners : l if !<collides>] which always produces a list (length 0 or 2 — same element type), so HCL never needs to unify two tuple types. Validated locally with OpenTofu v1.8.5 against a minimal tfvars fixture: - `tofu validate` → "Success! The configuration is valid." - `tofu console` with sovereign_fqdn="t29.omani.works", parent="omani.works": emits 4 listeners (parent https/http for .omani.works + per-prov https-t29-omani-works/http-t29-omani-works for .t29.omani.works) — matches PR #1892's intent. - `tofu console` with sovereign_fqdn="omani.works" (collision): emits 2 listeners (only parent https/http) — collision guard preserved. No chart bump; this is a tofu-only change. Re-closes #1886 after #1892 re-opened it via the type-mismatch regression. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 05:33:35 +04:00
github-actions[bot]	1b31b85d42	deploy: update catalyst images to `0020ef8`	2026-05-19 01:25:23 +00:00
e3mrah	0020ef8129	fix(catalyst-api): seed owner UserAccess at bake-time, not at handover (TBD-A34, Closes #1891 ) (#1893 ) D21 (owner UserAccess CR) was previously only seeded by auth_handover.go::seedOwnerUserAccess after a live PIN-login. The zero-touch convergence verifier cannot drive a PIN-login from CI, so D21 stayed RED on every fresh prov until an operator manually authenticated — even though SOVEREIGN_FQDN + OPERATOR_EMAIL + the UserAccess CRD are all stable on the chroot from bake-time onward. This slice adds a bake-time goroutine in main() that calls the existing handler.EnsureOwnerUserAccess against the in-cluster dynamic client when: - the dynamic client is non-nil (in-cluster mode), - SOVEREIGN_FQDN env is set (chroot mode), and - OPERATOR_EMAIL env is set (orgEmail stamped via sovereign-fqdn ConfigMap). Capped backoff (0/5/10/20/40s) tolerates the UserAccess CRD rolling behind us. Idempotent — EnsureOwnerUserAccess folds AlreadyExists to nil, so the existing handover-fired path still works without regression. Each skip / converged / error path logs at Info or Warn so an operator can confirm bake-time seeding from stdout without scraping the CR. Tests in cmd/api/main_test.go cover the happy path, all three skip branches (nil client, empty SOVEREIGN_FQDN, empty OPERATOR_EMAIL), and an idempotent re-run simulating Pod restart. Refs A116 diagnostic; supersedes the handover-only seed path for zero-touch verification. Closes #1891 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 05:22:13 +04:00
github-actions[bot]	b34f56dd22	deploy: update catalyst images to `1da2162`	2026-05-19 01:04:29 +00:00
e3mrah	1da216205a	fix(gateway): add per-prov 2-label wildcard listener for shared parent zones (Closes #1886 , TBD-A32) (#1892 ) The Cilium Gateway template emits `hostname: .<parent-zone>` listeners (e.g. `.omani.works`). Per Gateway-API spec wildcard semantics that matches EXACTLY one label depth, so `foo.omani.works` matches but `console.t28.omani.works` does NOT. On every shared-parent-zone topology (every per-prov Sovereign under omani.works) the operator-facing FQDN is 2-label-deep — `curl -skI https://console.t28.omani.works/` reset at TLS handshake even though `sovereign-wildcard-tls-t28-omani-works` already contained all 13 per-prov SANs. Fix: locals.per_prov_listeners in infra/hetzner/main.tf appends an extra listener pair hostnamed `.<sovereign_fqdn>` bound to the per-prov cert `sovereign-wildcard-tls-<fqdn-dashed>` rendered by clusters/_template/sovereign-tls/cilium-gateway-cert.yaml. Skipped when sovereign_fqdn equals one of the declared parent-zone names (legacy single-zone-on-apex case) so no duplicate listener-name Conflict. Verified by simulated jsonencode against three scenarios: 1. t28 multi-zone (sovereign_fqdn=t28.omani.works, parent_domains= [omani.works, omani.homes]) — emits 6 listeners: https-omani-works hostname=.omani.works cert=sovereign-wildcard-tls-omani-works http-omani-works hostname=.omani.works https-omani-homes hostname=.omani.homes cert=sovereign-wildcard-tls-omani-homes http-omani-homes hostname=.omani.homes https-t28-omani-works hostname=.t28.omani.works cert=sovereign-wildcard-tls-t28-omani-works http-t28-omani-works hostname=*.t28.omani.works 2. t28 single parent zone (sovereign_fqdn=t28.omani.works, parent_domains=[omani.works]) — emits 4 listeners (bare `https`/`http` for backward-compat with legacy sectionName HTTPRoutes + per-prov `https-t28-omani-works`/`http-t28-omani-works`). 3. Legacy apex (sovereign_fqdn=omani.works, parent_domains= [omani.works]) — collision guard active, emits only bare `https`/`http`. All scenarios produce unique listener names. Safe because every catalyst-system HTTPRoute now omits sectionName (PR #1888 closing #1884) — Cilium attaches via hostname match, so the per-prov 2-label listener catches `console.<fqdn>` / `api.<fqdn>` / `marketplace.<fqdn>` / etc. Refs A110 t28 scorecard, A107 D29 walk. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 05:02:36 +04:00
hatiyildiz	ae4ead480a	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.183 -> 1.4.184 (auto, Refs TBD-A6)	2026-05-19 00:55:00 +00:00
e3mrah	ed91f40d57	fix(sovereign-tls): wire Cilium Gateway listener at per-prov cert; stop parent-zone wildcard render (TBD-A29, Closes #1883 ) (#1890 ) The Sovereign's Cilium Gateway listener `https-<parent-zone>` referenced the parent-zone wildcard Secret `sovereign-wildcard-tls-<sanitised(parent)>` (e.g. `sovereign-wildcard-tls-omani-works` for `.omani.works`). That cert is minted by `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` and SHARES Let's Encrypt's "5 New Certificates per Exact Set of Identifiers per 168h" bucket with every other Sovereign on the same parent zone. After ~5 wipe+reprov cycles on `omani.works` the listener pinned to a `Ready=False` Certificate (cert-manager spun the order forever, LE returned `urn:ietf:params:acme:error:rateLimited`). A107 t28 evidence: per-prov cert `sovereign-wildcard-tls-t28-omani-works` IS `Ready=True` but unused. Fix (two parts): 1. `infra/hetzner/main.tf` — `parent_domains_listeners_yaml` now points each listener's `tls.certificateRefs[0].name` at the PER-PROV cert `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}` (rendered by `clusters/_template/sovereign-tls/cilium-gateway-cert.yaml` with the explicit SAN list `[console.<sovereign-fqdn>, auth.<sovereign-fqdn>, ..., sandbox.<sovereign-fqdn>]`). Per-prov identifier sets get their own 5/168h bucket per Sovereign so reprovs never share LE budget. New `local.sovereign_fqdn_dashed = replace(var.sovereign_fqdn, ".", "-")` is the SAME suffix `cilium-gateway-cert.yaml` / `cilium-envoy-tls-restart-job.yaml` already use, so the listener + cert + restart-job RBAC stay in lockstep. 2. `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` -- skip-render unconditionally (`{{- if false }}` wrap around the `wildcardCert.enabled` guard). The parent-zone wildcards it minted are no longer referenced by anything and burn LE budget on every install. Template body kept for `git blame` / future revival under issue #831 (multi-listener per-zone tenant TLS with non-wildcard SAN lists). Removes 2 Certificate resources per multi-zone Sovereign. Verification (helm template): helm template products/catalyst/chart \ --set parentZones[0].name=omani.works --set parentZones[0].role=primary \ --set parentZones[1].name=omani.homes --set parentZones[1].role=sme-pool \ --set global.sovereignFQDN=t28.omani.works \ --set wildcardCert.enabled=true \ \| grep -c 'sovereign-wildcard-cert' # before: 2 (two parent-zone Certificates rendered) # after: 0 (zero -- template skip-renders) Chart bumped 1.4.182 -> 1.4.183 so the next Blueprint Release republishes the OCI artifact with the skip-render change. Hostname semantics unchanged: listener `hostname: .<parent-zone>` still matches any FQDN under the parent; cilium-envoy SNI dispatch serves the per-prov cert whose SAN list covers the requested hostname (operator's console/auth/gitea/etc. subdomains under `<sovereign-fqdn>`). Tenant URLs under non-primary parent zones (`wp-foo.omani.homes`) remain out of scope for A29; those need explicit per-tenant cert wiring via #831. Closes #1883 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:54:18 +04:00

1 2 3 4 5 ...

2648 Commits