openova

Author	SHA1	Message	Date
e3mrah	2050e72c69	fix(infra): refactor L3 ExternalIP reconciler to write_files + bump CP guardrail to 32256 (Closes #1981 , Refs #1979 #1941 ) (#1985 ) PR #1979 (TBD-A50 layer 3, merged 18:00Z 2026-05-19) added the idempotent ExternalIP reconciler as inline runcmd heredocs and bumped the rendered cloud-init guardrail from 30720 to 31744. The ~3 KiB of inline bash + systemd unit heredocs overshot the new headroom: t36 fresh-prov tofu plan FAILED with rendered control-plane cloud-init at ~32498 B vs the 31744 B guardrail (754 B over). Issue #1981. This PR repackages PR #1979 using the PR #1978 pattern that fixed the analogous #1977 / TBD-A52 incident: - Adds an `l3` subcommand to /usr/local/bin/openova-externalip-bootstrap.sh (the same write_files script that hosts `l1` + `l2`). Same reconciler logic — read /etc/openova/cp-public-ipv4, compare to Node ExternalIP, restart k3s on mismatch, log to /var/log/openova-externalip.log. - Adds two new write_files entries for the systemd .service + .timer unit files (replaces the 3× cat-heredoc runcmd block). - The runcmd L3 step collapses from 77 lines of inline heredocs to a single token: `systemctl daemon-reload && systemctl enable --now openova-extip-reconcile.timer`. - Bumps the CP cloud-init guardrail from 31744 to 32256 (Hetzner hard cap 32768 minus 512 B safety buffer), applied to both primary + secondary CP preconditions in main.tf. The +512 B headroom buys room for the next legitimate addition without re-tripping the gate. ## Behavior Behavior identical to PR #1979 — same reconciler script, same exit codes (0=ok, 2=no-file, 3=apiserver-unreachable, 4=unrecovered), same systemd .service `SuccessExitStatus=0 2 3 4`, same .timer `OnBootSec=2min / OnUnitActiveSec=5min`. Diagnostic strings trimmed (~150 B saved) but key tokens preserved (`OK`, `MISMATCH`, `RECOVERED`, `FATAL nofile`, `FATAL apiserver`, `FATAL unrec`, `#1941` reference). ## Validation (Principle #15) - `tofu validate infra/hetzner/` → Success - Templatefile() measurement harness (`/tmp/measure-cloudinit/`, same fixture PR #1978 used): - pre-fix rendered: 31865 B (over fixture 30720 by 1145 B) - post-fix rendered: 31130 B (under new 32256 guardrail with 1126 B headroom) - savings: ~735 B vs PR #1979 baseline - Production headroom (after +633 B fixture↔prod variance offset): estimated 31763 B in prod, 493 B headroom under new 32256 guardrail. - `shellcheck` on rendered bootstrap script: clean (only one pre- existing SC2034 for loop counter `i`, present before this PR). - Mock test 3-case battery (matching/missing-file/mismatch-recovers): rc=0/2/0 with expected log tokens. ## Hard rules - `Closes #1981` because acceptance is code-level (size proof + tofu validate). The functional Refs #1941 closure still depends on fresh- prov walk demonstrating timer fires + log accumulates. - READ-ONLY on cluster. No Secrets touched. No emrah.baysal email / Stalwart admin API touched. Refs #1941, #1979, #1978, #1977, #1958, #966. Co-authored-by: hatiyildiz <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:58:57 +04:00
e3mrah	b96d731fcd	fix(infra): idempotent ExternalIP reconciler (TBD-A50 layer 3, Refs #1941 ) (#1979 ) Layer 3 of the three-layer Hetzner ExternalIP guard. Layers 1 (fail-fast on empty metadata curl) + 2 (post-install ExternalIP assertion) shipped in PR #1958; this PR adds the periodic reconciler so a node that somehow loses its ExternalIP post-boot (operator-initiated k3s restart without the env var, kubelet flag drift after an in-place upgrade, cloud-init partial-replay) can recover WITHOUT a re-provision. ## What lands A new runcmd item in cloudinit-control-plane.tftpl writes three files on first boot via heredocs: - `/usr/local/bin/openova-extip-reconcile.sh` — script that reads `/etc/openova/cp-public-ipv4` (persisted by Layer 1), compares against `kubectl get node $hostname -o jsonpath=...ExternalIP`, restarts k3s on mismatch, re-verifies, appends every run to `/var/log/openova-externalip.log` - `/etc/systemd/system/openova-extip-reconcile.service` — `Type=oneshot`, `SuccessExitStatus=0 2 3 4` so the timer doesn't back off on diagnostic exit codes - `/etc/systemd/system/openova-extip-reconcile.timer` — `OnBootSec=2min`, `OnUnitActiveSec=5min`, `AccuracySec=30s` The runcmd ends with `systemctl daemon-reload && systemctl enable --now`. Recovery path is INDEPENDENT of cloud-init: an operator can manually `printf '%s' <ip> > /etc/openova/cp-public-ipv4` and the next timer fire reconciles. No external dependency — pure systemd unit. ## Size guardrail The 30720-byte rendered cloud-init guardrail (issue #966) on the primary + secondary CP `hcloud_server` resources bumped to 31744 to absorb the ~2 KiB Layer 3 payload (still 1 KiB under the Hetzner hard 32768 cap). Worker variants stay at 30720 — cloudinit-worker.tftpl is untouched. ## Validation - `tofu validate infra/hetzner/` → Success (Principle #15) - `shellcheck` on the rendered script body → 0 warnings - Mock-test of all branches (matching IP no-op; empty IP recovers via restart; missing expected-file exit 2) → 3/3 pass ## Hard rule Refs #1941 not Closes. Closure requires the fresh 3-region prov walk + in-cluster verification of the timer firing (`systemctl status openova-extip-reconcile.timer`) and the log file accumulating entries (`tail /var/log/openova-externalip.log`). Refs #1941 Co-authored-by: hatiyildiz <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:00:51 +04:00
e3mrah	6b428b1304	fix(infra): move Layer 1+2 bash to write_files to fit cloud-init under 30720 (Closes #1977 , Refs #1958 , #1941 ) (#1978 ) PR #1958 (TBD-A50, merged 14:45Z 2026-05-19) inlined Layer 1 (fail-fast on empty Hetzner public-ipv4) and Layer 2 (post-install ExternalIP assertion) as runcmd: heredocs in cloudinit-control-plane.tftpl. The combined ~2.6 KB of bash pushed the rendered control-plane cloud-init PAST the 30 720 B Hetzner guardrail enforced by the precondition at infra/hetzner/main.tf:1036: condition = length(local.control_plane_cloud_init) <= 30720 t35 fresh provision (2026-05-19 17:12Z, 3-region cpx52) FAILED at tofu apply plan-validation with that precondition firing for the primary CP AND both secondary regions (nbg1-2 + hel1-1). Every fresh provision since #1958 merged is blocked by this regression — Issue #1977, TBD-A52. Fix: move the bash bodies into a write_files entry as /usr/local/bin/openova-externalip-bootstrap.sh, exposed as two subcommands `l1` and `l2`. The runcmd: items now just invoke the script via single-token calls: - /usr/local/bin/openova-externalip-bootstrap.sh l1 - <k3s install line - unchanged> - <wait /healthz - unchanged> - /usr/local/bin/openova-externalip-bootstrap.sh l2 Behavior is identical to PR #1958: - L1 still fail-fasts with exit 87 when Hetzner metadata returns empty body for public-ipv4. Validated IP persists to /etc/openova/cp-public-ipv4 so the next runcmd reads it from disk. - L2 still polls Node ExternalIP up to 60s, restarts k3s once if empty, polls another 60s post-restart, exits 88 if still empty. - Same DoD A2 invariant guard, same Issue #1941 / TBD-A50 coverage. Side effects: - Verbose diagnostic echo strings trimmed (saves ~600 B). Exit codes 87/88 + in-script identifier (l1-fatal/l2-fatal) + Issue #1941 ref are enough for the cloud-init.log root-cause lookup. Operator runbooks reference the exit codes — those are preserved. - Stripped template size: 25 443 B (#1958) → 24 315 B (this PR). - Rendered cloud-init (post-substitution, with t35-shape vars): ~33 600 B → ~29 800 B in t35-equivalent model — back under the 30 720 B guardrail. - Layer 3 (idempotent reconciler) is being worked on in parallel by agent ac0b077a — this refactor leaves headroom (~2.7 KB) for a third subcommand `l3` on the same script (no new write_files envelope cost). Validation: - `tofu validate infra/hetzner/` → "Success! The configuration is valid." (OpenTofu v1.8.5) - Mock templatefile() + strip-regex measurement: rendered size with realistic t35-shape placeholders = 29 816 B, 904 B headroom under the 30 720 B guardrail. - Heredoc body content preserved verbatim (kubectl invocations, polling loops, restart-once flow, exit codes). diff against PR #1958 shows pure repackaging — no semantic change to the runtime bash. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:57:00 +04:00
e3mrah	c0b61541c4	fix: default MARKETPLACE_ENABLED=true at source (TBD-V4) — Closes #1968 , Refs #1966 (#1971 ) * fix: default MARKETPLACE_ENABLED=true at source (provisioner + tofu + wizard) — Closes #1968, Refs #1966 PR #1967 changed only the bootstrap-kit slot fallback to `${MARKETPLACE_ENABLED:-true}`, but provisioner.go:1213 was still writing `MARKETPLACE_ENABLED: "false"` literal to tfvars (req.MarketplaceEnabled bool zero=false), substituting through the envsubst-replaced default and leaving franchised Sovereigns marketplace-disabled despite the slot flip. This commit pairs the source-side default flip across all three layers: 1. handler/deployments.go CreateDeployment — pre-initialise the provisioner.Request with `MarketplaceEnabled: true` BEFORE json.Decode. encoding/json only assigns fields present in the body, so a POST that OMITS marketplaceEnabled keeps the pre-init true while the wizard's explicit `marketplaceEnabled: false` (StepMarketplace opt-OUT) still wins. Canonical Go pattern for default-true bool fields without changing the struct shape. 2. infra/hetzner/variables.tf — flip the `marketplace_enabled` tofu var default from `"false"` to `"true"` so a `tofu plan` outside catalyst-api (CI mocks, manual replays) matches the new semantics. 3. UI store.test.ts — update the stale assertion that expected `marketplaceEnabled === false`; INITIAL_WIZARD_STATE.marketplaceEnabled has been true since the D27 zero-touch ruling on 2026-05-16, and the persist-rehydrate path already defaults missing values to true (store.ts:789). The test was the last remnant of the pre-D27 default. Bumps bp-catalyst-platform Chart.yaml 1.4.206 → 1.4.207 and the matching bootstrap-kit pin so the chart-pin-versus-GHCR CI gate accepts the new release. Unit test TestCreateDeployment_MarketplaceEnabledDefaultsTrue covers all three semantics: - omitted-defaults-true → MarketplaceEnabled=true - explicit-true-passes-through → MarketplaceEnabled=true - explicit-false-wizard-opt-out → MarketplaceEnabled=false Closes #1968 Refs #1966 #1741 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra/hetzner): escape $${MARKETPLACE_ENABLED:-true} in variable description OpenTofu interpreted the unescaped `${MARKETPLACE_ENABLED:-true}` inside the description string as a template interpolation and rejected the module init with "Variables not allowed" + "Extra characters after interpolation expression". The `${...}` shell-style envsubst syntax must be doubled to `$${...}` for OpenTofu to treat it as a literal. Caught by `infra/hetzner — OpenTofu validate + test` CI on PR #1971. Refs #1968 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:21:55 +04:00
e3mrah	bf3fa91be3	fix(infra): fail-fast on missing Hetzner public IP + post-install ExternalIP assertion (Refs #1941 , A2 invariant) (#1958 ) * fix(infra): fail-fast on missing Hetzner public IP + post-install ExternalIP assertion (Refs #1941, A2 invariant) PR #1715 added `--node-external-ip=$CP_PUBLIC_IPV4` to the k3s server install line, but the metadata curl was chained with `&&` to the install command. If Hetzner metadata returns HTTP 200 with EMPTY body (observed on t34, 2026-05-19), `curl -fsSL` exits 0, `CP_PUBLIC_IPV4=""`, and the chain proceeds to install k3s with `--node-external-ip=` (empty). k3s happily enrolls the node with InternalIP=10.0.1.2 and NO ExternalIP → Cilium tunnel endpoint stays on the locally-scoped private IP → every cross-region VXLAN tunnel resolves to 10.0.1.2 on the peer side → inter-region pod traffic blackholes. DoD A2 invariant ("inter-region link = DMZ WireGuard over PUBLIC IPs ALWAYS") VIOLATED. Blocks D31 (CNPG hot-standby), G5 (Hubble inter-region), all multi-region pod-to-pod. Issue #1941 / TBD-A50. Layer 1 — fail-fast guard in cloud-init: - Split the metadata curl into its own runcmd item with `\|\| true` so we can inspect the result without failing the whole script. - Validate the returned value is non-empty; if empty, dump curl -v diagnostics and exit 87 — cloud-init.log surfaces the FATAL immediately instead of a silent ClusterMesh blackhole hours later. - Persist the validated IP to /etc/openova/cp-public-ipv4 so the next runcmd item (the k3s install) and downstream items can read it without re-curl'ing. Layer 2 — post-install ExternalIP assertion: - After `until kubectl get --raw /healthz`, poll node.status.addresses[type=ExternalIP] for 60s. - If empty, restart k3s ONCE (the systemd unit on disk already carries --node-external-ip from the install) and recheck for another 60s. - If still empty after restart, exit 88 with the full node YAML in stderr — cloud-init.log surfaces the regression and the operator knows D11/D31/G5 will fail BEFORE any application workload tries to schedule. Layer 3 (idempotent periodic reconciler that re-asserts ExternalIP post-boot) is filed as a separate follow-up issue — bigger scope, needs a systemd timer + image roll. Not blocking #1941 closure. Validation: - `tofu validate` against infra/hetzner/ → "Success! The configuration is valid." - Inline bash tests for both fail-fast paths: * mock curl returns empty body, exit 0 → script exits 87 ✓ * mock curl returns "49.13.123.45", exit 0 → script persists IP and continues ✓ - Rendered cloud-init size (after comment-strip in main.tf:997) = 25 443 bytes, well under the 30 720 byte guardrail (line 1037). DO NOT close #1941 with this PR — closure requires a fresh 3-region provision walk + cross-region pod-to-pod ping. PR ships the cloud-init guards; convergence walk validates end-to-end. Refs #1941 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(infra): tofu fmt main.tf (pre-existing whitespace drift unblocking CI) The infra-hetzner-tofu.yaml workflow runs `tofu fmt -check -recursive` before validate. main.tf has accumulated whitespace alignment drift on two locals blocks (lines ~867-880 and ~1417-1455 — secondary-region templatefile() arg lists) that has caused that workflow to fail RED on every push and PR for 2+ days. This PR cannot reach a green check without unblocking it. This commit is whitespace-only (`tofu fmt`) — no semantic change. Kept in a separate commit from the load-bearing #1941 fix in the previous commit so reviewers can audit the data-plane change independently. Refs #1941 --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 18:45:19 +04:00
e3mrah	20b502d790	fix(infra/hetzner): drop tuple-shape conditional in per_prov_listeners (TBD-A35, Closes #1886 ) (#1894 ) PR #1892 (TBD-A32 fix for shared-zone collision) introduced an HCL "Inconsistent conditional result types" error at infra/hetzner/main.tf line 468. Every fresh prov failed at tofu plan in 23s, e.g. A127 t29 attempt (deployment 4afd9ebceea92547) at 2026-05-19 01:08:41Z. Root cause: `local.per_prov_listeners` was defined as local.parent_domains_includes_sovereign_fqdn ? [] : [HTTPS_obj, HTTP_obj] HCL/tofu cannot unify the conditional arms: the true arm is `tuple([])` (length 0) and the false arm is `tuple([obj_with_tls, obj_without_tls])` (length 2). Even moving the conditional to the consumer line in `concat()` did not fix it — the same length-0 vs length-2 tuple unification still fails. Fix: emit `per_prov_listeners` unconditionally as the 2-element tuple, then suppress it at the `concat()` consumer with a for-iteration filter [for l in local.per_prov_listeners : l if !<collides>] which always produces a list (length 0 or 2 — same element type), so HCL never needs to unify two tuple types. Validated locally with OpenTofu v1.8.5 against a minimal tfvars fixture: - `tofu validate` → "Success! The configuration is valid." - `tofu console` with sovereign_fqdn="t29.omani.works", parent="omani.works": emits 4 listeners (parent https/http for .omani.works + per-prov https-t29-omani-works/http-t29-omani-works for .t29.omani.works) — matches PR #1892's intent. - `tofu console` with sovereign_fqdn="omani.works" (collision): emits 2 listeners (only parent https/http) — collision guard preserved. No chart bump; this is a tofu-only change. Re-closes #1886 after #1892 re-opened it via the type-mismatch regression. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 05:33:35 +04:00
e3mrah	1da216205a	fix(gateway): add per-prov 2-label wildcard listener for shared parent zones (Closes #1886 , TBD-A32) (#1892 ) The Cilium Gateway template emits `hostname: .<parent-zone>` listeners (e.g. `.omani.works`). Per Gateway-API spec wildcard semantics that matches EXACTLY one label depth, so `foo.omani.works` matches but `console.t28.omani.works` does NOT. On every shared-parent-zone topology (every per-prov Sovereign under omani.works) the operator-facing FQDN is 2-label-deep — `curl -skI https://console.t28.omani.works/` reset at TLS handshake even though `sovereign-wildcard-tls-t28-omani-works` already contained all 13 per-prov SANs. Fix: locals.per_prov_listeners in infra/hetzner/main.tf appends an extra listener pair hostnamed `.<sovereign_fqdn>` bound to the per-prov cert `sovereign-wildcard-tls-<fqdn-dashed>` rendered by clusters/_template/sovereign-tls/cilium-gateway-cert.yaml. Skipped when sovereign_fqdn equals one of the declared parent-zone names (legacy single-zone-on-apex case) so no duplicate listener-name Conflict. Verified by simulated jsonencode against three scenarios: 1. t28 multi-zone (sovereign_fqdn=t28.omani.works, parent_domains= [omani.works, omani.homes]) — emits 6 listeners: https-omani-works hostname=.omani.works cert=sovereign-wildcard-tls-omani-works http-omani-works hostname=.omani.works https-omani-homes hostname=.omani.homes cert=sovereign-wildcard-tls-omani-homes http-omani-homes hostname=.omani.homes https-t28-omani-works hostname=.t28.omani.works cert=sovereign-wildcard-tls-t28-omani-works http-t28-omani-works hostname=*.t28.omani.works 2. t28 single parent zone (sovereign_fqdn=t28.omani.works, parent_domains=[omani.works]) — emits 4 listeners (bare `https`/`http` for backward-compat with legacy sectionName HTTPRoutes + per-prov `https-t28-omani-works`/`http-t28-omani-works`). 3. Legacy apex (sovereign_fqdn=omani.works, parent_domains= [omani.works]) — collision guard active, emits only bare `https`/`http`. All scenarios produce unique listener names. Safe because every catalyst-system HTTPRoute now omits sectionName (PR #1888 closing #1884) — Cilium attaches via hostname match, so the per-prov 2-label listener catches `console.<fqdn>` / `api.<fqdn>` / `marketplace.<fqdn>` / etc. Refs A110 t28 scorecard, A107 D29 walk. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 05:02:36 +04:00
e3mrah	ed91f40d57	fix(sovereign-tls): wire Cilium Gateway listener at per-prov cert; stop parent-zone wildcard render (TBD-A29, Closes #1883 ) (#1890 ) The Sovereign's Cilium Gateway listener `https-<parent-zone>` referenced the parent-zone wildcard Secret `sovereign-wildcard-tls-<sanitised(parent)>` (e.g. `sovereign-wildcard-tls-omani-works` for `.omani.works`). That cert is minted by `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` and SHARES Let's Encrypt's "5 New Certificates per Exact Set of Identifiers per 168h" bucket with every other Sovereign on the same parent zone. After ~5 wipe+reprov cycles on `omani.works` the listener pinned to a `Ready=False` Certificate (cert-manager spun the order forever, LE returned `urn:ietf:params:acme:error:rateLimited`). A107 t28 evidence: per-prov cert `sovereign-wildcard-tls-t28-omani-works` IS `Ready=True` but unused. Fix (two parts): 1. `infra/hetzner/main.tf` — `parent_domains_listeners_yaml` now points each listener's `tls.certificateRefs[0].name` at the PER-PROV cert `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}` (rendered by `clusters/_template/sovereign-tls/cilium-gateway-cert.yaml` with the explicit SAN list `[console.<sovereign-fqdn>, auth.<sovereign-fqdn>, ..., sandbox.<sovereign-fqdn>]`). Per-prov identifier sets get their own 5/168h bucket per Sovereign so reprovs never share LE budget. New `local.sovereign_fqdn_dashed = replace(var.sovereign_fqdn, ".", "-")` is the SAME suffix `cilium-gateway-cert.yaml` / `cilium-envoy-tls-restart-job.yaml` already use, so the listener + cert + restart-job RBAC stay in lockstep. 2. `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` -- skip-render unconditionally (`{{- if false }}` wrap around the `wildcardCert.enabled` guard). The parent-zone wildcards it minted are no longer referenced by anything and burn LE budget on every install. Template body kept for `git blame` / future revival under issue #831 (multi-listener per-zone tenant TLS with non-wildcard SAN lists). Removes 2 Certificate resources per multi-zone Sovereign. Verification (helm template): helm template products/catalyst/chart \ --set parentZones[0].name=omani.works --set parentZones[0].role=primary \ --set parentZones[1].name=omani.homes --set parentZones[1].role=sme-pool \ --set global.sovereignFQDN=t28.omani.works \ --set wildcardCert.enabled=true \ \| grep -c 'sovereign-wildcard-cert' # before: 2 (two parent-zone Certificates rendered) # after: 0 (zero -- template skip-renders) Chart bumped 1.4.182 -> 1.4.183 so the next Blueprint Release republishes the OCI artifact with the skip-render change. Hostname semantics unchanged: listener `hostname: .<parent-zone>` still matches any FQDN under the parent; cilium-envoy SNI dispatch serves the per-prov cert whose SAN list covers the requested hostname (operator's console/auth/gitea/etc. subdomains under `<sovereign-fqdn>`). Tenant URLs under non-primary parent zones (`wp-foo.omani.homes`) remain out of scope for A29; those need explicit per-tenant cert wiring via #831. Closes #1883 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:54:18 +04:00
e3mrah	139a620ea7	fix(sovereign-tls): cilium-gateway propagates Hetzner LB annotations via spec.infrastructure (#1889 ) Closes #1885 (TBD-A31). Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z): `console.t28.omani.works:443` accepts TCP but TLS resets. Inspection: `kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned `hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying 443→30443 service-port at the infra layer, the cluster-side hcloud-CCM has no signal to materialise a parallel Service-level LB for the auto-generated gateway Service — so operators inspecting kubectl see a non-LoadBalancer Service and conclude the LB chain is broken. Fix: Add `spec.infrastructure.annotations` to the Gateway resource. The Gateway-API spec mandates that controllers propagate these annotations to any infrastructure resources they create — in Cilium 1.16+ this means the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system. hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the annotations up at Service reconcile time and provisions a Hetzner LB. Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml): - load-balancer.hetzner.cloud/name = <slug>-<region>-gateway - load-balancer.hetzner.cloud/location = <Hetzner DC> - load-balancer.hetzner.cloud/type = lb11 - load-balancer.hetzner.cloud/use-private-ip = "false" (DoD A2 — public IPs always) - load-balancer.hetzner.cloud/disable-private-ingress = "true" - load-balancer.hetzner.cloud/health-check-protocol = tcp - load-balancer.hetzner.cloud/health-check-port = "30443" - load-balancer.hetzner.cloud/health-check-interval = 15s - load-balancer.hetzner.cloud/health-check-timeout = 10s - load-balancer.hetzner.cloud/health-check-retries = "3" Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in the LB name so each multi-region peer's cilium-gateway gets its own public LB (Hetzner LBs are unique-by-name; duplicate-name allocations collapse to the first-created instance, hiding the LB for every subsequent region). Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY, HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's postBuild.substitute block. These mirror the same vars already passed to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block in 01-cilium.yaml apiserver.service.annotations, so the configuration boundary is symmetric across the gateway LB and the clustermesh LB. Memory rules respected: - A2 (PUBLIC IPs for inter-region) — use-private-ip=false - feedback_overlap_provs_dont_serialize_wait (no provisioning gate) - feedback_subagents_inherit_design_system (no new architectural seam, reuses existing Gateway-API + hcloud-CCM contracts) Validation: $ kubectl kustomize clusters/_template/sovereign-tls/ \| grep -A 30 'kind: Gateway' → renders all 10 Hetzner LB annotations under spec.infrastructure → ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION} substituted at Flux apply time Acceptance criteria (per issue): - kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows type=LoadBalancer with external IP (after fresh prov + handover) - curl -skI https://console.<fqdn>/ returns HTTP 200 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:50:35 +04:00
e3mrah	f07312c5ae	fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852 ) Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179 + bootstrap-kit pin bump + cloud-init substitute extension, because each fix is small and they share the same fresh-prov verification cycle. TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the chroot in-cluster fallback's k8sCache.Factory reflector emitted continuous `networkpolicies is forbidden` errors at the cluster scope because only update/patch/delete were granted (existing mutation block) — the read path was never wired. Mirrors the existing cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7). TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields configuredRegions / controlPlaneIP / primaryRegion / replicaRegion / selfDeploymentId / enableHotStandby / qaApplications empty on every fresh prov. Pre-fix the envsubst placeholders resolved to empty because nothing wrote them into the bootstrap-kit Kustomization postBuild substitute map → the chart rendered empty strings → Dashboard SovereignCard configured-regions chips, Settings page operator-identity, /api/v1/sovereign/self, and the D31 active-hot-standby gating ALL silently fell through to default behaviour. Wired via three coordinated changes: - Chart values.yaml gains global.sovereignSelfDeploymentId default - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId, sovereign.configuredRegions, sovereign.qaApplications mappings (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`) - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION / SOVEREIGN_REPLICA_REGION (canonical 4-segment labels), SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty), SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list), QA_APPLICATIONS_YAML (reserved, default `[]`) - main.tf: new template inputs sovereign_configured_regions_yaml + replica_region_canonical_label (derived from local.secondary_regions), threaded into both primary CP and per-secondary-region cloud-init templatefile calls TBD-A10b (issue #1845) — GET /api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409 kubeconfig-file-missing on fresh prov for every region. Pre-fix the handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly call with the bare `cloudRegion` (`?region=hel1`) because that's the matrix-doc-friendly form. Fall-back resolution order added to GetKubeconfig: exact-name first (legacy + manual operator PUT), then `<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test covers all three paths: exact match, slot-suffix glob, unknown-region still 409. Closes the regression introduced when PR #1763 (mothership→chroot kubeconfig handover hook) started using the cloud-init naming convention for fan-out exports. Closes #1843, Closes #1844, Closes #1845 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:21:38 +04:00
e3mrah	0538f6ee68	fix(infra): advertise public IP as k3s node-external-ip so Cilium inter-region tunnel works (Refs TBD-A7) (#1715 ) Add --node-external-ip=$${CP_PUBLIC_IPV4} to the k3s server install in infra/hetzner/cloudinit-control-plane.tftpl so every CP publishes BOTH node.status.addresses[InternalIP=10.0.1.2] AND ExternalIP=<public ipv4>. Bug evidence (Wave 28-E, t22-omantel-biz 2026-05-18): hel/fsn/sin all advertise InternalIP=10.0.1.2 with NO ExternalIP. After the 2026-05-15 per-region-network refactor every region's CP sits in its OWN isolated hcloud_network, so 10.0.1.2 is locally scoped on each VPS and NOT routable cross-region. Cilium picks the InternalIP as its tunnel endpoint by default → cross-region VXLAN tunnels resolve to 10.0.1.2 on every peer → inter-region pod traffic blackholes (pod-to-pod 0/6 across regions). docs/SOVEREIGN-MULTI-REGION-DOD.md A2 mandate: "inter-region link = DMZ WireGuard over PUBLIC IPs ALWAYS (never any provider's private network)". Publishing the public IPv4 as ExternalIP lets Cilium promote it to the tunnel endpoint when peer addresses include External + Internal, which restores cross-region pod reachability without breaking intra-cluster paths — InternalIP stays primary for kube-apiserver advertise + pod-to- CP dial (the original reason --node-ip was pinned to private in PR-#62-era; the comment at line 1370-1378 still holds and is preserved). Effect: - Only takes effect on FRESH provisions (t23+). t22 already deployed cannot be remediated by a cloudinit change. - Both primary CP and secondary CPs go through this same template (main.tf templatefile() calls for primary at line 636 and per secondary at line 1187), so a single template edit covers all regions. - Approach A (smaller / immediate). Approach B (DMZ WireGuard overlay DaemonSet per platform/bp-dmz-vcluster/) follows as architectural follow-up if A alone doesn't fully resolve cross-region pod traffic on t23+. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:58:14 +04:00
e3mrah	cc13aec980	fix(sovereign-tls): bare https/http listener names when single parent zone (collision with chart HTTPRoutes sectionName) (#1682 ) PR #1640 renamed Cilium Gateway listeners to `https-<sanitised-zone>` / `http-<sanitised-zone>` to support multi-zone Sovereigns (primary + SME pool). That broke single-zone Sovereigns because every platform chart's HTTPRoute (harbor, keycloak, grafana, gitea, openbao, powerdns, stalwart-tenant) hardcodes `parentRefs[0].sectionName: https`. Result: every HTTPRoute reports `Accepted=False NoMatchingListener`, Sovereign Console / Harbor / Keycloak etc. unreachable through the Gateway. Fix: when `len(parent_domains_decoded) == 1` (the common case), render listener names as the bare strings `https` / `http`. When > 1 (SME pool present), keep the unique `https-<zone>` / `http-<zone>` naming so the Gateway controller doesn't hit a duplicate-name Conflicting condition. Multi-zone tenants whose HTTPRoutes must attach under a non-primary zone override `sectionName` via values.yaml — out of scope here. The per-zone certificateRefs.name (`sovereign-wildcard-tls-<sanitised-zone>`) is unchanged — independent of the listener name. Verified: kubectl kustomize clusters/_template/sovereign-tls/ clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:51:42 +04:00
e3mrah	422da46360	fix(sovereign-tls): cilium-gateway listeners per parentZone (#1640 ) Issue #831 follow-on to #827. Previously the Cilium Gateway declared a single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under non-primary parent zones (e.g. wp-foo.omani.homes when the operator brings omani.homes as the SME pool) hit cilium-envoy's default fallback cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR \#827) existed but had no Gateway listener claiming its hostname. Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent zone. Materialised at Terraform plan time as a JSON-flow array (infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode of the listener objects iterating decoded parent_domains_yaml), threaded through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}` in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay in lockstep. Scalar placeholder (not multi-line block) because kustomize-build parses the YAML before Flux runs envsubst — a placeholder on its own line at column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then swaps it for the JSON-flow array string, which the apiserver parses as the real listener list. Single-zone fallback preserved (var.parent_domains_yaml empty → [{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions (e.g. primary omani.works + sme-pool omani.homes) render 4 listeners. Verification: - kubectl kustomize clusters/_template/sovereign-tls/ → clean - End-to-end simulation (single-zone, two-zone) renders correct listener counts (2 / 4) with correct certificateRefs per zone. - Listener naming `https-<sanitised>` / `http-<sanitised>` is unique per listener so Gateway controller programs them all (duplicate names produce Conflicting status condition). Files: - clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar listeners placeholder + comment block explaining the why) - infra/hetzner/main.tf (locals.parent_domains_decoded + locals.parent_domains_listeners_yaml; threaded into primary CP and secondary regions' templatefile() calls) - infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML substitute var in sovereign-tls Kustomization block) Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:09:26 +04:00
e3mrah	0242be5c49	fix(infra): PR O — cilium-gateway TLS references per-zone wildcard cert (#1595 ) t143 hit LE PROD rate limit (50 certs/week on omani.works exhausted) because TWO cert templates compete for the same parent-domain quota: 1. clusters/_template/sovereign-tls/cilium-gateway-cert.yaml — legacy SAN cert named `sovereign-wildcard-tls` 2. products/catalyst/chart/templates/sovereign-wildcard-certs.yaml — chart per-zone cert named `sovereign-wildcard-tls-<sanitised-zone>` The Cilium Gateway listener hardcoded the legacy name, so when LE 429s the legacy cert (as happened on t143), HTTPS to console.<fqdn> breaks even though the per-zone cert is Ready. Fix: gateway listener now references `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}`. Cloud-init substitutes SOVEREIGN_FQDN_DASHED = replace(fqdn, ".", "-") in the sovereign-tls Kustomization postBuild.substitute. The per-zone cert from the chart provides the Ready Secret with this exact name. The legacy cilium-gateway-cert.yaml SAN cert still renders for backward-compat (some consumers may still reference it), but the gateway listener no longer depends on it for TLS termination. Bumps no chart version — the change is at the Flux/Kustomize layer. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 13:19:10 +04:00
e3mrah	c148ec6a34	fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22) (#1575 ) PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-} slot-file placeholders WITHOUT the $$ escape. tofu's templatefile() parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu expression — failing with "Extra characters after interpolation expression; Template interpolation doesn't expect a colon". Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s. The escape pattern is documented at main.tf:1029 (the same warning that caught t127 last week). $$ prefix tells tofu's templatefile to emit literal \${...} to cloud-init for Flux envsubst. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:31:26 +04:00
e3mrah	57939585c0	feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) (#1571 ) * feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22) Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment reads to populate the deployment record so Sovereign Console Settings page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL, orgName (instead of `—` placeholders). Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName, controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with empty defaults. Per-Sovereign overlays wire actual values from cloud- init substitute placeholders (mirrors regionsJson pattern). Catalyst-api Pod now reads them via valueFrom configMapKeyRef + optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap so env stays empty there — correct, mothership is signer not validator). Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP post-#1568. This PR fills the remaining 3 D22 fields when operator wires the values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(slot-13): add D22 sovereign-side identity placeholders Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} + ${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569) → catalyst-api env → chrootEnsureDeployment populates the deployment record → Settings page renders real values instead of `—`. This PR alone is a no-op (placeholders default to empty, same as today). The cloud-init substitute lines + provisioner.go tfvars need to land in a companion PR to actually populate the values on next-prov. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit Kustomization's postBuild.substitute env, which the slot-13 placeholders (#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}. Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment populates the deployment record (#1567 + #1568 fallback). SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes the dependency cycle (hcloud_server.cp doesn't exist at cloudinit render time). Separate PR will source it via metadata-service or post-create ConfigMap patch. Next-prov (t133+) Sovereign Console Settings page now renders real ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 01:47:04 +04:00
e3mrah	1c988b9a4b	fix(firewall): open NodePort range 30000-32767 for clustermesh LB (D11) (#1538 ) PR #1537's use-private-ip approach was not viable: the per-region Hetzner LB has no private-network attachment by default (LB private_net is empty) and our DoD A2 architecture pins one private /24 per region that does NOT span across regions. The LB->backend hop has to transit the public path. The actual blocker is the Sovereign firewall: it permits 80/443/6443/53 and blocks the NodePort range. Hetzner LB TCP health-check probes `<node-public-ip>:<NodePort>` and gets dropped → all targets marked unhealthy → external clients see "unexpected eof while reading" at TLS handshake → cilium clustermesh agent stays `0/N remote clusters ready, Waiting for initial connection`. Security: clustermesh-apiserver requires mTLS. Peer agents must present a client cert signed by the peer cluster's cilium-ca (PR #1530). Anonymous connections rejected at handshake. mTLS is the security boundary, NOT the firewall — opening NodePorts is safe here. Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — completes the D11 incident chain (#1525 → #1528 → #1530 → #1536 → this). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:44:02 +04:00
e3mrah	1f30a08ae3	fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534 ) The Sovereign-side catalyst-api runs in "chroot" mode — it has no parent prov record, so chrootEnsureDeployment synthesises a minimal in-memory Deployment with only SovereignFQDN set. The /infrastructure/topology loader then sees empty Request.Regions[] and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes) which only sees THIS cluster's Node(s) → emits exactly 1 Region even on a 3-region Sovereign. /cloud?view=graph renders as "1 cluster 1 region" — DoD D5 failure. Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported `console.t126.omani.works/cloud?view=graph` showed 1 region despite mothership openova-flow snapshot holding all 3 regions correctly. This PR threads the canonical multi-region RegionSpec[] from the mothership prov body all the way to the Sovereign-side catalyst-api: tofu var.regions → jsonencode → sovereign_regions_json tftpl var → cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON → bp-catalyst-platform slot 13 sovereign.regionsJson value → sovereign-fqdn ConfigMap key `regionsJson` → catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom) → chrootEnsureDeployment parses JSON, populates Request.Regions[] → topology loader emits one Region per spec entry Single-region Sovereigns: var.regions has length 1; chart writes the array literal; chroot synth still produces 1 Region — no regression. Empty env: chroot falls back to live-Nodes path (legacy behavior preserved). Refs DoD D5. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:45:24 +04:00
e3mrah	357feb0843	fix(tofu): escape ${...} in comment that broke templatefile() (t127) (#1533 ) Unescaped `${DMZ_VCLUSTER_ENABLED:=true}` Flux envsubst expression inside a tftpl comment was being parsed by tofu's templatefile() as a tftpl interpolation. tofu's `:=` is not a valid tftpl operator, so tofu plan failed with: ./cloudinit-control-plane.tftpl:1021,71-72: Extra characters after interpolation expression; Template interpolation doesn't expect a colon at this location. Every other `${...}` reference in tftpl comments in this file is properly escaped as `$${...}` (e.g. lines 12, 850, 893, 971, 996, 1039, 1138). Mine slipped through PR #1531. Fix: rewrite the comment to NOT include any `${...}` expression (since the expression was just illustrative), avoiding the escape gymnastics entirely. Caught on t127 (b7942a70f7516e9e, 2026-05-16) — first prov after PR #1531 landed FAILED in tofu plan stage within 60s. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:39:43 +04:00
e3mrah	904686ff0d	fix(vcluster): canonical region label substitute + per-role enable flags (#1531 ) Caught on t126 (84c0848406dd6fdd, 2026-05-16): bp-{dmz,mgmt,rtz}-vcluster charts installed but DMZ Pods Pending on every region with FailedScheduling. Pod nodeSelector was `openova.io/region=hel1` (from `${SOVEREIGN_REGION_KEY}` substitute = Hetzner region key "hel1"/"nbg1-1"/"sin-2"), but the k3s node-label is `openova.io/region=hz-hel-rtz-prod` (canonical 4-segment label written by cloud-init from `region_canonical_label` per PR #1512). Mismatch meant every vCluster Pod across every region sat Pending. MGMT + RTZ slot 58/59 charts also default-OFF with no substitute flipping them on per the DoD A4 topology (primary=MGMT+DMZ; secondary=DMZ+RTZ). This PR: 1. Adds `SOVEREIGN_REGION_CANONICAL_LABEL` substitute to tofu cloud-init `bootstrap-kit` postBuild block, sourced from per-region `region_canonical_label` tftpl var. 2. Adds `MGMT_VCLUSTER_ENABLED` + `RTZ_VCLUSTER_ENABLED` substitutes — primary CP renders true/false, secondary CP renders false/true. 3. Updates bootstrap-kit slots 54/58/59 to use the canonical label substitute. Slots 58/59 also read the per-role enable flag. Expected post-deploy state on a fresh 3-region prov: primary: DMZ + MGMT vCluster Pods Running (RTZ rendered zero) secondary: DMZ + RTZ vCluster Pods Running (MGMT rendered zero) Refs DoD A4 (vCluster topology). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 17:28:06 +04:00
e3mrah	ed19bb3f8d	fix(k3s): --disable-cloud-controller so providerID stays empty for our patch (#1524 ) Caught on t123 (a3bfa56adbcfb049, 2026-05-16): Gap A v3.1's patch loop hit k8s validation error: The Node "catalyst-t123-omani-works-cp1" is invalid: spec.providerID: Forbidden: node updates may not change providerID except from "" to valid k8s allows setting providerID from empty → valid, but NOT changing it. k3s's embedded cloud controller sets providerID=k3s://<hostname> BEFORE our cloud-init runcmd patch fires (race window). Once set, the patch is rejected. Fix: --disable-cloud-controller (alone, NOT with the cloud-provider= external kubelet arg that caused the chicken-and-egg taint in reverted PR #1513). This disables the k3s embedded cloud controller so it never sets providerID; the kubelet leaves providerID empty; our runcmd patch successfully sets hcloud://<id>. hcloud-ccm (installed later via Flux) sees the correct providerID and allocates per-region LBs. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 15:25:54 +04:00
e3mrah	0ebd137547	fix(cloud-init): retry providerID patch up to 30× when Node not yet registered (#1523 ) Caught on t122 (7e519eb997af236c, 2026-05-16): primary + sin patched fine, but nbg1's kubectl patch failed because the Node object hadn't yet appeared in the apiserver between healthz OK and Node registration. Result: nbg1 stuck at providerID=k3s://... → CCM rejected its LB allocation → clustermesh-apiserver external_ip stayed <pending> on nbg1 → AutoEstablishClusterMesh couldn't fully mesh. Add a 30-iter loop (150s budget): get node first; if found, patch; else sleep 5. Hetzner apiserver registers Nodes within ~10-30s of k3s install on healthy clusters. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 14:58:59 +04:00
e3mrah	ef93a2cdbe	feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520 ) Architecturally-clean replacement for the reverted PRs #1513 (k3s flag) and #1516 (pre-install hcloud-ccm). Both prior approaches broke cold-start (chicken-and-egg with the uninitialized taint). This patch instead lets k3s boot normally with its default embedded cloud controller (which sets `providerID=k3s://<hostname>` — the problem), then immediately patches the local Node's `spec.providerID` to `hcloud://<id>` using the Hetzner instance metadata endpoint (169.254.169.254). The patch runs ONCE per CP node, right after k3s apiserver healthz becomes reachable, BEFORE flux-bootstrap.yaml applies the bootstrap-kit Kustomization. Once providerID has the canonical `hcloud://` prefix, bp-hcloud-ccm (installed by Flux later in the bootstrap-kit chain) accepts the node as a Hetzner-managed instance and allocates LBs for Service type=LoadBalancer normally. That unblocks: - D12: clustermesh-apiserver Service gets a real external IP instead of <pending> - D10: AutoEstablishClusterMesh (PR #1508) can read each region's LB IP and write peer entries into cilium-clustermesh Secret - D11: inter-region pod-to-pod traffic flows via Cilium WG over the per-region LB IPs - D5: child catalyst-api can reach secondary regions via mesh, so /cloud view aggregates all 3 regions instead of 1/1 Failure is non-fatal: if metadata lookup or patch fails, we log and continue (bp-hcloud-ccm has a chance to set providerID later via its own node-list-and-match logic). Cold-start is never blocked. Canonical topology (1 cpx52 per region, workerCount=0) means every node is a CP — covered by this patch. Operator-added workers (workerCount>0) would also need providerID patched; a follow-up Job in bp-providerid-patcher can iterate all nodes post-Flux. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 14:12:26 +04:00
e3mrah	766890510b	Revert PR #1516 + #1517 — Gap A hcloud-ccm pre-install hangs cloud-init (#1518 ) * Revert "fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)" This reverts commit `05c6edb4fe`. * Revert "fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)" This reverts commit `b7140b9069`. --------- Co-authored-by: claude <claude@anthropic.com>	2026-05-16 13:32:18 +04:00
e3mrah	05c6edb4fe	fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517 ) PR #1516 added ~3KB of hcloud-ccm bootstrap manifests inline (Secret + ServiceAccount + ClusterRoleBinding + Deployment with full toleration list + container args). Rendered cloud-init now exceeds the 30720 precondition on every primary + secondary CP: Error: Resource precondition failed on main.tf line 716: length(local.control_plane_cloud_init) <= 30720 Caught on t118 prov (0619287065fb58c8, 2026-05-16): apply failed at both primary AND nbg1-1 + sin-2 simultaneously. Hetzner hard cap is 32768 bytes. Bump guardrail to 32000 (96.5% of hard cap) — leaves a 768-byte safety margin while admitting the hcloud-ccm pre-install legitimately needed bytes. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 13:15:21 +04:00
e3mrah	b7140b9069	fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516 ) DoD multi-region gates D5/D10/D11/D12-LB-pending all trace to one root cause: k3s sets node.spec.providerID=k3s://<hostname>. hcloud-ccm rejects every LoadBalancer-Service allocation because the prefix isn't hcloud://, so clustermesh-apiserver Service stays <pending> → AutoEstablishClusterMesh (PR #1508) hard-fails → no peer entries → no inter-region pod traffic → openova-flow-emitter on secondaries can't reach openova-flow-server on primary → /cloud view sees only 1 region. PR #1513 attempted the kubelet-flag-only fix (--cloud-provider=external + --disable-cloud-controller) banking on Flux's bp-hcloud-ccm slot 55 to install the CCM. Reverted in PR #1514 because Flux pods themselves cannot land on a node tainted node.cloudprovider.kubernetes.io/ uninitialized=NoSchedule — chicken-and-egg, 0 HRs after 30 min. Architecturally correct fix: pre-install hcloud-ccm via raw manifests in cloud-init, BEFORE flux-bootstrap.yaml apply. Once the Deployment runs (with uninitialized-taint toleration), CCM matches the node to its Hetzner server, writes providerID=hcloud://<id>, kubelet lifts the taint, Flux proceeds normally. Flux later "adopts" this Deployment via bp-hcloud-ccm HelmRelease (release name collides cleanly with `helm upgrade --install`). Changes: - cloudinit-control-plane.tftpl: - Re-add k3s install flags --disable-cloud-controller + --kubelet-arg=cloud-provider=external (same flags as reverted #1513). - New write_files entry /var/lib/catalyst/hcloud-ccm-bootstrap.yaml containing Secret kube-system/hcloud (token + network keys), ServiceAccount, ClusterRoleBinding, and Deployment with full toleration set (uninitialized + CriticalAddonsOnly + control-plane + master + not-ready). Image pulled via harbor.openova.io proxy- cache of hetznercloud/hcloud-cloud-controller-manager:v1.20.0 (mirrors platform/hcloud-ccm/chart/Chart.yaml appVersion pin, per MIRROR-EVERYTHING rule). - New runcmd steps inserted AFTER the local-path StorageClass setup and BEFORE the kubeconfig postback: kubectl apply the manifest, then poll node.spec.providerID for up to 300s waiting for hcloud:// prefix. On timeout, dump CCM pod + logs and exit 1. - cloudinit-worker.tftpl: - Add --kubelet-arg=cloud-provider=external to agent install. Workers join the cluster after the primary CP's CCM is up; worker kubelet will wait for the same external CCM to set its providerID. Secondary regions (local.secondary_region_cloud_init in main.tf) call the SAME cloudinit-control-plane.tftpl, so the fix inherits to every secondary CP automatically. No main.tf changes needed — hcloud_token and hcloud_network_name were already threaded into both primary and secondary templatefile() calls. DoD impact: unblocks D5 (/cloud 3-regions), D10 (Cilium peer entries), D11 (inter-region pod-to-pod via WG), D12 (LB external IPs no longer <pending>). After this lands plus a fresh prov, those four DoD gates flip green; expected 13-14/14 on next t118 cycle. Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md, session_2026_05_16_t117_dod_partial.md Reverts: tail of PR #1513 left the worker tftpl untouched, but #1514's revert restored it to no-flag state. This PR re-applies the flag intent correctly because the CCM is now present at the moment kubelet starts. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 13:06:49 +04:00
e3mrah	f30a49fba5	Revert "fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513 )" (#1514 ) This reverts commit `7f0de7fa82`. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-16 12:12:38 +04:00
e3mrah	7f0de7fa82	fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513 ) DoD gate D12-LB-allocation root cause: k3s registers nodes with providerID=k3s://<hostname> instead of hcloud://<server-id>. hcloud-ccm rejects every LB allocation: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the expected prefixes (hcloud://, hrobot://, hcloud://bm-): k3s://catalyst-t115-omani-works-nbg1-1-cp1 This blocked clustermesh-apiserver Service from getting an external IP on every secondary region → AutoEstablishClusterMesh (PR #1508) couldn't write peer entries → D10/D11 fail. Caught on t115.omani.works (577be15281be2587, 2026-05-16) after PR #1509 flipped clustermesh-apiserver Service to LoadBalancer. The NodePort default in the old chart masked this k3s-vs-hcloud-ccm incompatibility until the LoadBalancer flip exposed it. Fix (k3s server install line in cloudinit-control-plane.tftpl): + --disable-cloud-controller + --kubelet-arg=cloud-provider=external Fix (k3s agent install line in cloudinit-worker.tftpl): + --kubelet-arg=cloud-provider=external The k3s server flag tells the embedded cloud controller to stay out. The kubelet flag tells kubelet to wait for an external CCM to set providerID. hcloud-ccm (bootstrap-kit slot 36) then matches each node to its Hetzner server by name and sets providerID=hcloud://<id>, unblocking LB allocation, Volume CSI, and node-external-ip. The node is briefly tainted node.cloudprovider.kubernetes.io/ uninitialized=NoSchedule until the CCM removes it — Flux's bootstrap-kit Kustomization tolerates this taint via SOPs. Co-authored-by: claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 11:34:25 +04:00
e3mrah	dc590855a1	fix(tofu): per-region cloud-init renders with secondary's own values, not primary's (#1512 ) * fix(tofu): per-region cloud-init renders with secondary's own values, not primary's Root cause: cloudinit-control-plane.tftpl hardcoded the literal `openova.io/region=hz-fsn-rtz-prod` on the k3s install line. Every CP node — primary AND every secondary — labeled itself with that fixed string regardless of the cluster's real region. The template variables `region` and `sovereign_region_key` were already wired per-region in main.tf, but this one node-label flag was written as a constant. Concrete impact on prov t114.omani.works (a1448e0b9e471f5d, 2026-05-16): - Primary cluster (hel1) k3s nodes carried `hz-fsn-rtz-prod` even though Sovereign primary = hel1. qa-fixtures Pods targeted `openova.io/region in [hz-fsn-rtz-prod]` and silently landed on the wrong-named nodes — the scheduler accepted but the cluster name didn't match the label, breaking the OpenovaFlow canvas's per-region grouping and any downstream selector reading the label. - Secondary clusters (nbg1, sin) carried the same hardcoded label so their k3s nodes never reported their own region, again breaking the canvas (D13) and the Continuum DR region awareness. - clusters/_template/bootstrap-kit/01-cilium.yaml further masked the bug with a `${HCLOUD_LB_LOCATION:=hel1}` default fallback on the clustermesh-apiserver Service annotation — for a Sovereign with primary=hel1 the fallback APPEARED correct but silently masked any rendering failure path where the substitute might be missing. Fix shape: 1. Introduce locals.region_canonical_label in main.tf, keyed by region key ("primary" + every secondary key). Each value is computed as `hz-<region-prefix-no-digits>-rtz-prod` per NAMING-CONVENTION §2.1. 2. Thread `region_canonical_label` into BOTH the primary CP templatefile() call (from locals.region_canonical_label["primary"]) and the secondary CP templatefile() call (from locals.region_canonical_label[k]). 3. Replace the hardcoded literal in cloudinit-control-plane.tftpl line 1364 with `${region_canonical_label}` — each CP now labels its k3s node with ITS OWN canonical region tag. 4. Thread `QA_PRIMARY_REGION` substitute into the bootstrap-kit Kustomization's postBuild.substitute block so the chart's qaFixtures.primaryRegion seam (`${QA_PRIMARY_REGION:-hz-fsn-rtz-prod}`) is set to the Sovereign-wide primary region label, never the hardcoded `hz-fsn-rtz-prod` chart default. Identical value on every cluster's bootstrap-kit because qaFixtures.primaryRegion is Sovereign-wide singular. 5. Remove the `${HCLOUD_LB_LOCATION:=hel1}` fallback default in 01-cilium.yaml — the cloud-init substitute ALWAYS provides a value, so a missing substitute is a tofu rendering bug that should surface at chart admission, not silently render hel1. Provider-agnostic per DoD A6: the `hz` prefix is correct only because this file lives under infra/hetzner/; future infra/aws/ and infra/huawei/ modules will derive `aw` / `hw` in their own per-module locals using the same pattern. DoD impact unblocked: - D10 (cilium clustermesh peer entries): clustermesh-apiserver Service now annotates the correct region for hcloud-ccm LB allocation on every peer, not just primary=hel1. - D12 (clustermesh LB external IP allocated): no longer pending on non-hel1 primary or any secondary because the location annotation now reflects each peer's real region. - D13 (canvas per-region bubble grouping): k3s nodes report their actual region label so FlowNode.region values differentiate across clusters. Tests added (infra/hetzner/tests/multi_region.tftest.hcl, run "per_region_cloud_init_carries_secondarys_own_region"): - SOVEREIGN_REGION_KEY / HCLOUD_LB_LOCATION render per-region (regression test for the templatefile contract). - openova.io/region= node-label is the per-region canonical label (`hz-nbg-rtz-prod` on nbg1-1, `hz-sin-rtz-prod` on sin-2, `hz-hel-rtz-prod` on primary hel1). - QA_PRIMARY_REGION substitute carries the Sovereign's primary region label on every cluster's bootstrap-kit substitute. - Negative assertions catch any regression that re-introduces `hz-fsn-rtz-prod` on a non-fsn1 Sovereign. Test result: 7 passed, 2 pre-existing failures (qa_mode SKU override tests — unrelated, present on origin/main, separate contract from Fix #183 body-first coalesce). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(tofu): align qa_mode SKU tests with Fix #183 body-first coalesce contract Pre-existing test failures on origin/main since Fix #183 (PR #1386, 2026-05-11) inverted the coalesce direction in `local.effective_cp_size = local.qa_mode ? coalesce(var.control_plane_size, var.qa_control_plane_size) : var.control_plane_size`. The pre-Fix-#183 tests asserted that qa_control_plane_size wins when qa_fixtures_enabled='true', but the new contract is the OPPOSITE: body wins (variables.tf default `cpx22` for control_plane_size is non-empty so coalesce always picks it first; qa-default only activates when the body is empty, which provisioner.go achieves by CONDITIONALLY omitting the var in writeTfvars when the operator's body has no override — see provisioner.go:1280-1289). Inside tofu test we can't conditionally omit a variable, so the variables.tf default ALWAYS wins. Updated assertions: - qa_mode_on_flips_to_bigger_skus → asserts variables.tf default `cpx22` wins (the auto-flip is exercised at the provisioner-side boundary, not tofu-side). - qa_mode_on_respects_explicit_overrides → asserts the body-first behavior when only qa_control_plane_size is set (no control_plane_size override). - NEW qa_mode_on_body_overrides_win → asserts the operator's explicit control_plane_size/worker_size wins verbatim — the canonical "body wins" lane Fix #183 codified. Tests result: 10 passed, 0 failed (was 7 passed, 2 failed on origin/main since Fix #183). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 10:57:48 +04:00
e3mrah	0c9e391d59	fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile (#1511 ) * fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh apiserver Service MUST be LoadBalancer (NEVER NodePort). Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted ${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign landed with clustermesh-apiserver as NodePort, in direct violation of A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go, PR #1508) which hard-fails on Service.type != LoadBalancer. Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): - 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True - clustermesh-apiserver Service = NodePort on all 3 regions - cilium-clustermesh peer Secret empty (0 peers) — orchestrator never wrote them because of the type-check - D10 + D12 both failed silently Fix flips the chart default to LoadBalancer and threads Hetzner CCM LB annotations (location, type, name) from the bootstrap-kit substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE + HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init postBuild substitute block alongside the existing CLUSTER_MESH_NAME + CLUSTER_MESH_ID. Operator escape hatch preserved: bare-metal / non-cloud Sovereigns override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign bootstrap-kit overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile PR #1509 added ${sovereign_fqdn_slug} reference to cloudinit-control-plane.tftpl (for the Hetzner CCM LB name annotation on clustermesh-apiserver) and wired it into the PRIMARY templatefile() invocation in main.tf, but missed the SECONDARY-regions templatefile() at line ~990. Every multi-region prov now fails at `tofu plan`: Invalid value for "vars" parameter: vars map does not contain key "sovereign_fqdn_slug", referenced at ./cloudinit-control-plane.tftpl:991,37-56. Caught on prov t113.omani.works (82c3587b97156a08, 2026-05-15) — first multi-region prov against #1509's chart fix. Phase-0 failed at plan before any servers spun up. Fix is trivial: thread the same replace(var.sovereign_fqdn, ".", "-") through the for_each secondary block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 00:00:19 +04:00
e3mrah	5f8ba85dc5	fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) (#1509 ) DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh apiserver Service MUST be LoadBalancer (NEVER NodePort). Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted ${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign landed with clustermesh-apiserver as NodePort, in direct violation of A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go, PR #1508) which hard-fails on Service.type != LoadBalancer. Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): - 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True - clustermesh-apiserver Service = NodePort on all 3 regions - cilium-clustermesh peer Secret empty (0 peers) — orchestrator never wrote them because of the type-check - D10 + D12 both failed silently Fix flips the chart default to LoadBalancer and threads Hetzner CCM LB annotations (location, type, name) from the bootstrap-kit substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE + HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init postBuild substitute block alongside the existing CLUSTER_MESH_NAME + CLUSTER_MESH_ID. Operator escape hatch preserved: bare-metal / non-cloud Sovereigns override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign bootstrap-kit overlay. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 23:40:04 +04:00
e3mrah	93f699326a	infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net (#1507 ) * docs(sovereign): pin multi-region DoD contract — never divert from D1-D14 Founder ruling 2026-05-15: every silent compromise from the multi-region target-state architecture is a quality violation. This file locks the convergence contract so future Claude sessions cannot drift. Architecture invariants A1-A6: - 3 regions minimum (never drop to 2 to dodge provider capacity) - Inter-region link = DMZ WireGuard over PUBLIC IPs, ALWAYS (no hcloud_network cross-region, no VPC peering, no Huawei VPC) - Cilium ClusterMesh apiserver = LoadBalancer (NEVER NodePort) - vCluster topology: primary = MGMT+DMZ, secondary = DMZ+RTZ - Zero public exposure of K8s control-plane endpoints - Provider-mix is canonical (assume 1 Hetzner + 1 AWS + 1 Huawei) DoD gates D1-D14 enforced via Playwright MCP + kubectl + cilium CLI on every fresh prov. No partial credit, no "deferred", no "matrix-drift". Mirrored to auto-memory at ~/.claude/projects/-home-openova-repos-openova-private/memory/sovereign_multiregion_dod.md so it loads at every session start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net Implements A1+A2+A6 from docs/SOVEREIGN-MULTI-REGION-DOD.md. Each region gets its own hcloud_network (10.0.0.0/16 INSIDE each, not shared across). Inter-region link is exclusively Cilium WireGuard over PUBLIC IPs through the DMZ — no provider's internal network ever spans regions. - Replaces hcloud_network.main + hcloud_network_subnet.{main,secondary} with hcloud_network.region[] + hcloud_network_subnet.region[] (for_each over toset(local.all_region_keys); primary key = "primary", secondary keys = slice-G1 "{cloudRegion}-{index}" shape). - Per-region cluster-cidr (10.42+i.0/16) + service-cidr (10.96+i.0/16) threaded through cloud-init so ClusterMesh peers don't collide on pod/service CIDRs (DoD gate D11). - Firewall: open UDP 51871 from 0.0.0.0/0 (Cilium WG inter-region encryption) — without this the WG mesh between regions cannot form. - Each CP's local private IP is now uniformly 10.0.1.2 per region (every region has its own /24 inside its own /16 — no cross-region IP collision class possible by construction). - Hetzner resource names threaded to cluster-autoscaler now use hcloud_network.region["primary"\|<k>].name so autoscaler-spawned workers land in the same isolated /16 as their region's CP. - Pre-2026-05-15 state will plan a network-recreate on next apply; per DoD cycle protocol this is consciously accepted (no tofu state mv runbook, every wipe-and-create is a fresh provision). - tofu tests cover: per-region network count + uniform 10.0.0.0/16 + uniform 10.0.1.0/24 subnet + per-region cluster/service CIDRs + Cilium WG firewall rule existence. - README "Network" section adds the 3-region DMZ-WG ASCII topology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tofu): apply tofu fmt — fixes CI fmt-check on PR #1507 Apply OpenTofu's canonical formatting to main.tf. No semantic changes; only whitespace alignment under template substitute blocks where my refactor added 2-char fields (`cluster_cidr` and `service_cidr`) that perturbed the prior column alignment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude <claude@anthropic.com>	2026-05-15 22:04:32 +04:00
e3mrah	3a19bb161f	fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml (#1503 ) * fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate on first event (no /refresh-watch needed). But the openova-flow snapshot composer (flow_snapshot_local.go) emits finish-to-start relationships where fromId = jobs.JobID(deploymentID, dep). Without the "install-" prefix on each dep entry, fromId came out as: <dep>:hel1-2:seaweedfs (secondary, missing "install-") <dep>:gitea (primary, missing "install-") But the FlowNode ids in the snapshot are: <dep>:install-hel1-2:seaweedfs <dep>:install-gitea The FE canvas adapter matches by exact id → every finish-to-start rel points at a non-existent node → 224 rels emitted, 0 edges rendered. Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15): curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start every finish-to-start fromId malformed canvas: sibling edges invisible across all 135 install Jobs Fix in two places: internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit): Region-prefix each dep AND inject the "install-" prefix so ev.DependsOn = ["install-<region>:<chart>"] before the bridge receives the event. Symmetric with how ev.Component is constructed. internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent): Canonicalise every dep entry: if it doesn't already start with JobNamePrefix ("install-"), prepend it. Idempotent on entries that already are canonical (set by the phase1_watch.go path). Covers the primary-region path (bare chart names like "gitea") too — Job.DependsOn now stores "install-gitea", which matches the composer's emitted FromId exactly. Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.) * fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values Follow-up to PR #1500. The canon block ran on the event-carried dependsOn arg, but the 3-tier resolve preferred existing-store value when non-empty — which for any Job written BEFORE PR #1500 rolled out was malformed (no "install-" prefix). t103.omani.works snapshot kept emitting 224 finish-to-start rels with malformed fromIds because the existing Job rows held "hel1-2:gitea" entries that the resolve preserved verbatim. Fix: after the 3-tier resolve, run a final canonicalisation pass on resolvedDeps so every persisted entry is canonical regardless of whether it came from event-carried (already canon by my prior block) or from existing-store (potentially malformed legacy). Note: this fix only takes effect on the NEXT HR state transition for a given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs) will keep their malformed deps until a new event fires. The loop's next cycle (t104+) writes canonical from event 1. * fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator submitted a multi-region body (3 regions cpx52) but omitted ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0. Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux postBuild.substitute rendered cilium-config with cluster.name=default + cluster.id=0. Cilium kvstoremesh refused to start: "ClusterID 0 is reserved" clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed. Cross-region observability + east-west routing permanently broken. Auto-derivation: ClusterMeshName: <first-fqdn-label>-mesh e.g. t105.omani.works → "t105-mesh" ClusterMeshID: (sha256(deploymentID)[:4] as uint32) mod 252 + 1 Range [1, 252]; main.tf increments for secondaries so the max id any region sees is primary + (regions - 1) ≤ 254. ID 255 is intentionally avoided (Cilium sentinel). Operator override still respected — auto-derive only kicks in when both fields are zero/empty AND len(Regions) > 1. Single-region provs stay at "" / 0 (no mesh needed). Tested derive helpers against the last 4 prov IDs — all land in valid range: 98395b3d9bd9c1aa → 74 (secondaries 75, 76) 005080699326a7ac → 29 (secondaries 30, 31) 22af2b1120158239 → 139 c9df5eed1c1ba6cf → 180 Build + provisioner unit tests green. * fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99) correctly reached cilium-config — but only AFTER Flux helm-upgraded the release. The pre-Flux Cilium install (cloud-init line 1473) used /var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or cluster.id, so cilium-agent started with the chart defaults ("default", 0). The Flux upgrade then changed cilium-config but the already-running cilium-agent kept its in-memory cluster.name="default" because it reads ConfigMap once at startup. Downstream consequences observed live on t105: hubble-relay CrashLoopBackOff: "tls: failed to verify certificate: x509: certificate is valid for *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1 .default.hubble-grpc.cilium.io" clustermesh peer announcements use stale "default" identity → cross-region mesh handshakes x509-fail. Fix: include cluster.name + cluster.id in the pre-Flux helm install's values file, sourced from the templatefile() vars cluster_mesh_name + cluster_mesh_id (already threaded per-region by main.tf:381-382 and :900-901). Now the first cilium-agent process announces with the correct identity, no helm-upgrade race. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-15 19:48:58 +04:00
e3mrah	1dc21bfd51	fix(cloud-init): accept Hetzner DHCP routes on private NIC (use-routes: true) (#1489 ) The netplan stanza for the hot-attached private NIC had `dhcp4-overrides.use-routes: false`, which discards Hetzner DHCP's classless static routes. Result: the interface gets `10.0.1.2/32` (host route only) with NO route for the 10.0.0.0/8 private network. The kernel routes all return traffic (including SYN-ACK to the Hetzner LB at 10.0.1.254) via eth0's default route — the public NIC. Hetzner LB's health check on private network gets the SYN forwarded, but the SYN-ACK arrives via the wrong NIC; Hetzner drops it as asymmetric. Target stays `unhealthy` forever on every service port. Caught live on prov 6dfade27 (omani.works, 2026-05-14): all 3 region LBs marked unhealthy on 53/80/443 — public surface blackholed despite 3-region × 45/45 HRs Ready + valid PROD cert + envoy listening on 0.0.0.0:30443. Confirmed via tcpdump on the host: enp7s0 In 10.0.1.254.X > 10.0.1.2:30443 [S] ← SYN arrives on private eth0 Out 10.0.1.2:30443 > 10.0.1.254.X [S.] ← SYN-ACK on wrong NIC Fix: change to `use-routes: true`. Hetzner DHCP-provided routes have higher metric than eth0's default (metric 100), so the public default stays intact; we only gain the per-subnet 10.0.0.0/N route needed for symmetric routing on the private NIC. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 22:52:01 +04:00
e3mrah	cebc9542d7	fix(cloudinit): escape ${WILDCARD_CERT_ISSUER} reference in comment so templatefile() doesn't try to interpolate it (#1485 ) OpenTofu's `templatefile()` parses `${...}` expressions everywhere in the template body — including comments. A comment on line 1072 of cloudinit-control-plane.tftpl referenced the Kustomization-time variable `${WILDCARD_CERT_ISSUER}` as documentation, but tofu reads it as a template var lookup → fails with `vars map does not contain key "WILDCARD_CERT_ISSUER"` → `tofu plan` exit 1. Fix: escape the documentation reference with `$${WILDCARD_CERT_ISSUER}` so it survives as literal text in the rendered file. The actual variable binding `WILDCARD_CERT_ISSUER: "${wildcard_cert_issuer}"` two lines below is unchanged (it correctly maps the lowercase tofu local to the uppercase Kustomization postBuild key). Caught live on prov #81 (omani.works), the first provision after #1481 landed the WILDCARD_CERT_ISSUER threading. omantel.biz had been provisioned BEFORE #1481 merged so it never exercised the new tftpl path. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 20:20:51 +04:00
e3mrah	a88e132be9	fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu (#1481 ) clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 18:25:45 +04:00
e3mrah	a75463f76a	fix(cloud-init): wait for private NIC before k3s install (prov #71 ) (#1464 ) * fix(flow_snapshot): region-scope dep edges (no cross-region wiring) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-init): wait for private NIC before k3s install (prov #71) Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks BEFORE the NIC is ready, renders netplan with only eth0, and the private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN. Effect on secondary CPs: k3s server starts with --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2 and fatals on "listen tcp 10.0.11.2:2380: bind: cannot assign requested address" then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service restart counter reached 5394, kubeconfig never PUT back to mothership, canvas showed secondary region as a permanent black hole. Diagnosed via Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster fsn1 zone NIC attach. Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for the expected private IP (control plane) or a route to it (worker). If the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true and `netplan apply`. Bail loudly if the IP/route never appears — failures surface in cloud-init.log instead of disguising as a slow boot. Symmetric fix in worker template covers autoscaler-spawned secondary workers when worker_count > 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 07:39:25 +04:00
e3mrah	32e0b408bf	fix(k3s): add public IP --tls-san + openova.io/region node label (#1459 ) Two related fixes for multi-region + qa-fixtures DoD on prov #64: 1. k3s TLS cert needs the public IPv4 in SAN. Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s auto-generates the server cert with SANs from --tls-san flags. We only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2 + cluster-ip + 127.0.0.1 only. Bridge connection from contabo rejected with: "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1, ::1, not 204.168.212.113" → silent watcher failure → 0 secondary HRs observed → canvas missing region sub-groups. Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before k3s install, add it as --tls-san=$CP_PUBLIC_IPV4. 2. openova.io/region=hz-fsn-rtz-prod node label. qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs, qa-wp Application) carry hard nodeAffinity for `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion default in products/catalyst/chart/templates/qa-fixtures/*.yaml). Without the label every fixture pod FailedScheduling → bp-catalyst- platform post-install hook waits forever → bootstrap-kit chain hangs at 44/45 with bp-catalyst-platform Running. Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP (qa-fixtures pin to primary by design). Both shipped in same commit since both are inside the same k3s server install line. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:38:25 +04:00
e3mrah	44913d8a6a	fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458 ) prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook timed out because the catalyst-api Helm-released pod stayed Pending with "Too many pods. 0/1 nodes are available". k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/ flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on prov #63 the CP carried everything alone and dropped scheduling at 110. Bump to 220 on both CP and worker so the saturation point doesn't gate the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU + 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit weight. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 18:37:42 +04:00
e3mrah	5f4f9f2cb5	fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457 ) prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects its node IP from the primary interface, which on Hetzner cpx52 binds to the public IPv4 (49.x.x.x) instead of the private network IP (10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there; nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the private IP from cilium-config k8sServiceHost — times out, CrashLoop. Worked by luck on cpx42 (earlier kernel + Hetzner network attach timing). cpx52 reproduces 100%. Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip} in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443 (cilium-config substitute) find the API server every time. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:30 +04:00
e3mrah	68372d700b	fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:11:23 +04:00
e3mrah	be47815ddf	fix(infra): pass cp_private_ip to primary CP templatefile too (#1447 ) PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:01:43 +04:00
e3mrah	cdcc50a213	fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446 ) Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3 "no stretched fault domain". Cilium on each region MUST talk to its OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites hardcoded the primary's IP: 1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665): `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2). 2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}` so each region's k3s API cert validates against the LOCAL CP's IP. 3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml): add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR values so Flux postBuild.substitute can override per region. The cloud-init Kustomization renders the substitute var to `${cp_private_ip}`. Single-region (primary-only) provisions fall back to the default `10.0.1.2` and stay byte-identical to today. Live evidence of the bug — prov #52 (3-region) on 2026-05-12: cilium-operator on nbg1 secondary: "Establishing connection to apiserver" host="https://10.0.1.2:6443" "failed to start: ... tls: failed to verify certificate: x509: certificate signed by unknown authority" Each region's k3s has its OWN self-signed CA (cluster-init per CP). The primary's API cert isn't signed by the secondary's CA → cilium crash- loops → no CNI → flux controllers Pending → no HRs → canvas shows only primary's HRs. This fix points each region's cilium at the LOCAL CP, whose API server presents the matching CA from this cluster. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:56:18 +04:00
e3mrah	19a847e514	fix(infra): restore \n escape in secondary CP templatefile regex (#1445 ) The conflict-resolution Python script in PR #1444 wrote a literal newline where the regex string needed the two-char "\n" escape. tofu init rejected with "Invalid multi-line string / Unterminated template string" on main.tf:925. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:10 +04:00
e3mrah	4923938c2b	feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444 ) Operator mandate (2026-05-12): the mothership canvas must surface install-* HRs from EVERY region of a multi-region provision, not just the primary CP's. Today catalyst-api stores ONE kubeconfig per deployment (the primary CP's) and spawns ONE helmwatch.Bridge against it. Result: secondary regions are invisible on the canvas even though their k3s clusters are fully reconciling. End-to-end change across infra + handler: 1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL appends `?region=<kubeconfig_postback_region>` when the var is set. main.tf templatefile call passes empty for primary CP, `each.key` (e.g. "nbg1-1", "hel1-2") for each secondary region. 2) PutKubeconfig handler: reads ?region= query param. Empty → primary path (unchanged: stores at <dir>/<id>.yaml, sets Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty → secondary path: stores at <dir>/<id>-<region>.yaml, populates Deployment.secondaryKubeconfigPaths[region]. Single-use guard is per-region (the same bearer secures every CP's PUT — secondaries reuse it for their own slot). NO Phase-1 watch re-launch from a secondary PUT. 3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the primary's watcher. Scans <kubeconfigsDir>/<id>-.yaml every 15s, spawns one helmwatch.NewWatcher per kubeconfig discovered, stores the Watcher on Deployment.secondaryWatchers[region]. Per-region watchers emit ordinary helmwatch events with region-prefixed Component names so the wizard's per-component view doesn't collide primary vs secondary bp-cilium events. They do NOT contribute to markPhase1Done — outcome remains the primary's classification. 4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group bubbles + install- nodes from each secondary watcher's SnapshotComponents. Node id: <depID>:<region>:install-<chart>. FlowNode.region set so the canvas can colour-group. Intra-region finish-to-start deps emitted from cs.DependsOn — same-region only, never cross-region (per NAMING-CONVENTION §1.3 independent fault domains, no stretched cluster). 5) wipe.go: removes both <id>.yaml AND every <id>-.yaml secondary kubeconfig file on Sovereign wipe. Storage model is uniform across SME and corporate Sovereigns. No hardcoding of provider, region count, or building block. Caught after operator pointed out that 3-region prov #50 was showing only 52 install- nodes (all from fsn1) on the canvas — the architectural gap. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:12:38 +04:00
e3mrah	c5d891ad0b	fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443 ) The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name / hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster autoscaler could attach scale-up VMs to the private network. The primary CP's templatefile call at main.tf:483-485 was updated, but the matching call for secondary regions at main.tf:899 was missed. Result: any provision with regions[] of length > 1 fails at tofu plan with "vars map does not contain key hcloud_network_name" referenced in cloudinit-control-plane.tftpl:478. Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash) at T+0:47. Forward the same three resource refs to every secondary region's templatefile call. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:23:53 +04:00
e3mrah	b743b646ac	fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427 ) Root cause (autoscaler pod log, prov #43 chroot): W orchestrator.go:626 Node group workers is not ready for scaleup - backoff with status: Scale-up timed out for node group workers after 15m2.273255226s Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY: workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[] workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[] The worker cloud-init (identical to Phase-0 user_data) issues curl -sfL https://get.k3s.io \| K3S_URL=https://10.0.1.2:6443 ... sh - against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment that URL is unreachable → k3s agent install silent-fails → node never registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst- platform Pending Pods never schedulable → chroot canvas tests blocked. Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on the cluster-autoscaler deployment so the Hetzner provider attaches every scale-up VM to the SAME private network + firewall + ssh-key the Phase-0 Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net / -fw / catalyst-<sov-fqdn-with-dashes>). Names flow: Tofu (hcloud_network.main.name + hcloud_firewall.main.name + hcloud_ssh_key.main.name) → cloudinit-control-plane.tftpl (3 new template vars) → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys) → flux-system/cloud-credentials Secret → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*) → upstream chart's deployment env Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent regression of the three env-var slots in chart values.yaml. Reaffirms canonical seam: values flow through Tofu → cloud-init → flux-system Secret → Flux valuesFrom → chart values → upstream env. Never via kubectl patch, never via bespoke Go API calls. Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 06:11:30 +04:00
e3mrah	22855e62d8	feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396 ) Final integration piece for OpenovaFlow infrastructure path — catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits distinct region tags on every FlowNode and the snapshot returns 2× per HR on a multi-region Sovereign. Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst- ui temporary revert until npm workspaces land), PR #1395 (chart no-op). ## Scope vs original Agent #3 brief The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire + runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred: PR #1394 reverted Agent #1's UI wiring because the Docker UI build has no node_modules for the cross-workspace canvas source. Founder note on #1394: "Agent #3 (or a follow-up) will re-wire them properly once npm workspaces are configured at repo root." This PR ships the infrastructure half (proxy + cloud-init + runbook). The canvas-side rewire is a separate follow-up PR that needs npm workspaces, not surgical edits to FlowPage. ## What ships ### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events} products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go: - GET /snapshot — JSON pass-through, headers + status forwarded - GET /stream — unbuffered SSE pass-through using http.Flusher (NOT httputil.ReverseProxy; that buffers and breaks text/event-stream) - POST /events — body forwarded byte-for-byte - Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign in-cluster Service DNS) Routes registered in cmd/api/main.go inside the auth-gated chi.Group. 11 table-driven tests cover snapshot/events/stream pass-through, upstream 404/400/unreachable propagation, empty-deploymentId guard, SSE frames arrive AS EMITTED, and env-default fallback. ### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY - infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild. substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP - infra/hetzner/main.tf — primary CP renders var.region as region key; secondary CP renders each.key (e.g. "hel1-1") from for_each over local.secondary_regions - infra/hetzner/variables.tf — new sovereign_deployment_id var (string, default "" for tofu mocks) - provisioner.go writeTfvars — writes vars["sovereign_deployment_id"] = req.DeploymentID - bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY} envsubst keys ### 3. Deployment record flag handler/deployments.go State() — emits `openovaFlowEnabled: true` on every deployment. The catalyst-ui rewire (follow-up PR) will read this to enable the openova-flow-server adapter; legacy provisions without the flag will keep the bridge once the rewire lands. ### 4. Verification runbook docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body (multi-region cpx42 fsn1+hel1, qaTestEnabled=true, sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual canvas checks (gated on the follow-up UI rewire), and a failure-class triage table. ## Canonical-seam citations 1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/ deployments.go:1244-1287 (StreamLogs): identical Content-Type + Cache-Control + X-Accel-Buffering header set; identical http.Flusher.Flush() after each write; identical r.Context().Done() cancel path. 2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893 (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var} form, dual emission at primary + secondary CP for_each in main.tf. ## Verification ``` $ go build ./... (clean) $ go vet ./... (clean) $ go test ./internal/handler/ -run TestFlowProxy -count=1 -race ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler 1.410s $ go test ./internal/provisioner/... -count=1 ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner 0.025s ``` 3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields, TestHandleWhoami_PinSessionRBACClaims, TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on main HEAD without this PR — unrelated baseline state. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:01:09 +04:00
e3mrah	4e6bec7022	fix(infra): body-supplied SKUs win over QA defaults (Fix #183 ) (#1386 ) * fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 * fix(infra): body-supplied SKUs win over QA defaults (Fix #183) Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size, var.control_plane_size)` when qa_fixtures_enabled='true'. Because qa_control_plane_size has a non-empty default (cpx32), coalesce always returned the QA default and silently overrode whatever the body supplied in `controlPlaneSize`. Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"` explicitly (cheapest viable for the founder's collapsed-CP+worker single-node-per-region topology with workerCount=0). The QA-default override downgraded that to cpx32 at plan time — the explicit choice never made it onto the hardware. Fix #183 — invert the coalesce so body wins: effective_cp_size = local.qa_mode ? coalesce(var.control_plane_size, var.qa_control_plane_size) : var.control_plane_size `provisioner.go` writeTfvars already emits control_plane_size / worker_size only when the body's field is non-empty (so `var.control_plane_size` inherits variables.tf's cost-optimised default when the body left it blank). That means `coalesce(var.control_plane_size, var.qa_*)` always has a non-empty first arg in normal flow; the QA-default fallback only fires on a zero-override QA call that intentionally leaves the SKU empty. No change to customer-Sovereign behaviour (qa_fixtures_enabled='false' branch already used `var.control_plane_size` verbatim). Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:04:41 +04:00
e3mrah	515c3cf38d	fix(infra): pin LB private IPs + revert hel1 zone (Fix #182 ) (#1385 ) * fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:00:50 +04:00

1 2 3

122 Commits