Commit Graph

122 Commits

Author SHA1 Message Date
e3mrah
2050e72c69
fix(infra): refactor L3 ExternalIP reconciler to write_files + bump CP guardrail to 32256 (Closes #1981, Refs #1979 #1941) (#1985)
PR #1979 (TBD-A50 layer 3, merged 18:00Z 2026-05-19) added the
idempotent ExternalIP reconciler as inline runcmd heredocs and bumped
the rendered cloud-init guardrail from 30720 to 31744. The ~3 KiB of
inline bash + systemd unit heredocs overshot the new headroom: t36
fresh-prov tofu plan FAILED with rendered control-plane cloud-init
at ~32498 B vs the 31744 B guardrail (754 B over). Issue #1981.

This PR repackages PR #1979 using the PR #1978 pattern that fixed the
analogous #1977 / TBD-A52 incident:

- Adds an `l3` subcommand to /usr/local/bin/openova-externalip-bootstrap.sh
  (the same write_files script that hosts `l1` + `l2`). Same reconciler
  logic — read /etc/openova/cp-public-ipv4, compare to Node ExternalIP,
  restart k3s on mismatch, log to /var/log/openova-externalip.log.
- Adds two new write_files entries for the systemd .service + .timer
  unit files (replaces the 3× cat-heredoc runcmd block).
- The runcmd L3 step collapses from 77 lines of inline heredocs to
  a single token: `systemctl daemon-reload && systemctl enable --now
  openova-extip-reconcile.timer`.
- Bumps the CP cloud-init guardrail from 31744 to 32256 (Hetzner hard
  cap 32768 minus 512 B safety buffer), applied to both primary +
  secondary CP preconditions in main.tf. The +512 B headroom buys
  room for the next legitimate addition without re-tripping the gate.

## Behavior

Behavior identical to PR #1979 — same reconciler script, same exit
codes (0=ok, 2=no-file, 3=apiserver-unreachable, 4=unrecovered), same
systemd .service `SuccessExitStatus=0 2 3 4`, same .timer `OnBootSec=2min
/ OnUnitActiveSec=5min`. Diagnostic strings trimmed (~150 B saved) but
key tokens preserved (`OK`, `MISMATCH`, `RECOVERED`, `FATAL nofile`,
`FATAL apiserver`, `FATAL unrec`, `#1941` reference).

## Validation (Principle #15)

- `tofu validate infra/hetzner/` → Success
- Templatefile() measurement harness (`/tmp/measure-cloudinit/`,
  same fixture PR #1978 used):
    - pre-fix rendered: 31865 B (over fixture 30720 by 1145 B)
    - post-fix rendered: 31130 B (under new 32256 guardrail with
      1126 B headroom)
    - savings: ~735 B vs PR #1979 baseline
- Production headroom (after +633 B fixture↔prod variance offset):
  estimated 31763 B in prod, 493 B headroom under new 32256 guardrail.
- `shellcheck` on rendered bootstrap script: clean (only one pre-
  existing SC2034 for loop counter `i`, present before this PR).
- Mock test 3-case battery (matching/missing-file/mismatch-recovers):
  rc=0/2/0 with expected log tokens.

## Hard rules

- `Closes #1981` because acceptance is code-level (size proof + tofu
  validate). The functional Refs #1941 closure still depends on fresh-
  prov walk demonstrating timer fires + log accumulates.
- READ-ONLY on cluster. No Secrets touched. No emrah.baysal email
  / Stalwart admin API touched.

Refs #1941, #1979, #1978, #1977, #1958, #966.

Co-authored-by: hatiyildiz <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:58:57 +04:00
e3mrah
b96d731fcd
fix(infra): idempotent ExternalIP reconciler (TBD-A50 layer 3, Refs #1941) (#1979)
Layer 3 of the three-layer Hetzner ExternalIP guard. Layers 1 (fail-fast on
empty metadata curl) + 2 (post-install ExternalIP assertion) shipped in
PR #1958; this PR adds the periodic reconciler so a node that somehow loses
its ExternalIP post-boot (operator-initiated k3s restart without the env var,
kubelet flag drift after an in-place upgrade, cloud-init partial-replay) can
recover WITHOUT a re-provision.

## What lands

A new runcmd item in cloudinit-control-plane.tftpl writes three files on
first boot via heredocs:

- `/usr/local/bin/openova-extip-reconcile.sh` — script that reads
  `/etc/openova/cp-public-ipv4` (persisted by Layer 1), compares against
  `kubectl get node $hostname -o jsonpath=...ExternalIP`, restarts k3s on
  mismatch, re-verifies, appends every run to `/var/log/openova-externalip.log`
- `/etc/systemd/system/openova-extip-reconcile.service` — `Type=oneshot`,
  `SuccessExitStatus=0 2 3 4` so the timer doesn't back off on diagnostic
  exit codes
- `/etc/systemd/system/openova-extip-reconcile.timer` — `OnBootSec=2min`,
  `OnUnitActiveSec=5min`, `AccuracySec=30s`

The runcmd ends with `systemctl daemon-reload && systemctl enable --now`.

Recovery path is INDEPENDENT of cloud-init: an operator can manually
`printf '%s' <ip> > /etc/openova/cp-public-ipv4` and the next timer fire
reconciles. No external dependency — pure systemd unit.

## Size guardrail

The 30720-byte rendered cloud-init guardrail (issue #966) on the primary
+ secondary CP `hcloud_server` resources bumped to 31744 to absorb the
~2 KiB Layer 3 payload (still 1 KiB under the Hetzner hard 32768 cap).
Worker variants stay at 30720 — cloudinit-worker.tftpl is untouched.

## Validation

- `tofu validate infra/hetzner/` → Success (Principle #15)
- `shellcheck` on the rendered script body → 0 warnings
- Mock-test of all branches (matching IP no-op; empty IP recovers via
  restart; missing expected-file exit 2) → 3/3 pass

## Hard rule

Refs #1941 not Closes. Closure requires the fresh 3-region prov walk +
in-cluster verification of the timer firing (`systemctl status
openova-extip-reconcile.timer`) and the log file accumulating entries
(`tail /var/log/openova-externalip.log`).

Refs #1941

Co-authored-by: hatiyildiz <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:00:51 +04:00
e3mrah
6b428b1304
fix(infra): move Layer 1+2 bash to write_files to fit cloud-init under 30720 (Closes #1977, Refs #1958, #1941) (#1978)
PR #1958 (TBD-A50, merged 14:45Z 2026-05-19) inlined Layer 1 (fail-fast
on empty Hetzner public-ipv4) and Layer 2 (post-install ExternalIP
assertion) as runcmd: heredocs in cloudinit-control-plane.tftpl. The
combined ~2.6 KB of bash pushed the rendered control-plane cloud-init
PAST the 30 720 B Hetzner guardrail enforced by the precondition at
infra/hetzner/main.tf:1036:

  condition = length(local.control_plane_cloud_init) <= 30720

t35 fresh provision (2026-05-19 17:12Z, 3-region cpx52) FAILED at
tofu apply plan-validation with that precondition firing for the
primary CP AND both secondary regions (nbg1-2 + hel1-1). Every
fresh provision since #1958 merged is blocked by this regression —
Issue #1977, TBD-A52.

Fix: move the bash bodies into a write_files entry as
/usr/local/bin/openova-externalip-bootstrap.sh, exposed as two
subcommands `l1` and `l2`. The runcmd: items now just invoke the
script via single-token calls:

  - /usr/local/bin/openova-externalip-bootstrap.sh l1
  - <k3s install line - unchanged>
  - <wait /healthz - unchanged>
  - /usr/local/bin/openova-externalip-bootstrap.sh l2

Behavior is identical to PR #1958:
  - L1 still fail-fasts with exit 87 when Hetzner metadata returns
    empty body for public-ipv4. Validated IP persists to
    /etc/openova/cp-public-ipv4 so the next runcmd reads it from disk.
  - L2 still polls Node ExternalIP up to 60s, restarts k3s once if
    empty, polls another 60s post-restart, exits 88 if still empty.
  - Same DoD A2 invariant guard, same Issue #1941 / TBD-A50 coverage.

Side effects:
  - Verbose diagnostic echo strings trimmed (saves ~600 B). Exit
    codes 87/88 + in-script identifier (l1-fatal/l2-fatal) + Issue
    #1941 ref are enough for the cloud-init.log root-cause lookup.
    Operator runbooks reference the exit codes — those are preserved.
  - Stripped template size: 25 443 B (#1958) → 24 315 B (this PR).
  - Rendered cloud-init (post-substitution, with t35-shape vars):
    ~33 600 B → ~29 800 B in t35-equivalent model — back under the
    30 720 B guardrail.
  - Layer 3 (idempotent reconciler) is being worked on in parallel
    by agent ac0b077a — this refactor leaves headroom (~2.7 KB) for
    a third subcommand `l3` on the same script (no new write_files
    envelope cost).

Validation:
  - `tofu validate infra/hetzner/` → "Success! The configuration is
    valid." (OpenTofu v1.8.5)
  - Mock templatefile() + strip-regex measurement: rendered size with
    realistic t35-shape placeholders = 29 816 B, 904 B headroom under
    the 30 720 B guardrail.
  - Heredoc body content preserved verbatim (kubectl invocations,
    polling loops, restart-once flow, exit codes). diff against PR
    #1958 shows pure repackaging — no semantic change to the runtime
    bash.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:57:00 +04:00
e3mrah
c0b61541c4
fix: default MARKETPLACE_ENABLED=true at source (TBD-V4) — Closes #1968, Refs #1966 (#1971)
* fix: default MARKETPLACE_ENABLED=true at source (provisioner + tofu + wizard) — Closes #1968, Refs #1966

PR #1967 changed only the bootstrap-kit slot fallback to
`${MARKETPLACE_ENABLED:-true}`, but provisioner.go:1213 was still
writing `MARKETPLACE_ENABLED: "false"` literal to tfvars
(req.MarketplaceEnabled bool zero=false), substituting through the
envsubst-replaced default and leaving franchised Sovereigns
marketplace-disabled despite the slot flip.

This commit pairs the source-side default flip across all three layers:

1. handler/deployments.go CreateDeployment — pre-initialise the
   provisioner.Request with `MarketplaceEnabled: true` BEFORE
   json.Decode. encoding/json only assigns fields present in the body,
   so a POST that OMITS marketplaceEnabled keeps the pre-init true
   while the wizard's explicit `marketplaceEnabled: false`
   (StepMarketplace opt-OUT) still wins. Canonical Go pattern for
   default-true bool fields without changing the struct shape.

2. infra/hetzner/variables.tf — flip the `marketplace_enabled` tofu
   var default from `"false"` to `"true"` so a `tofu plan` outside
   catalyst-api (CI mocks, manual replays) matches the new semantics.

3. UI store.test.ts — update the stale assertion that expected
   `marketplaceEnabled === false`; INITIAL_WIZARD_STATE.marketplaceEnabled
   has been true since the D27 zero-touch ruling on 2026-05-16, and
   the persist-rehydrate path already defaults missing values to true
   (store.ts:789). The test was the last remnant of the pre-D27
   default.

Bumps bp-catalyst-platform Chart.yaml 1.4.206 → 1.4.207 and the matching
bootstrap-kit pin so the chart-pin-versus-GHCR CI gate accepts the
new release.

Unit test TestCreateDeployment_MarketplaceEnabledDefaultsTrue covers all
three semantics:
  - omitted-defaults-true            → MarketplaceEnabled=true
  - explicit-true-passes-through     → MarketplaceEnabled=true
  - explicit-false-wizard-opt-out    → MarketplaceEnabled=false

Closes #1968
Refs #1966 #1741

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra/hetzner): escape $${MARKETPLACE_ENABLED:-true} in variable description

OpenTofu interpreted the unescaped `${MARKETPLACE_ENABLED:-true}` inside
the description string as a template interpolation and rejected the
module init with "Variables not allowed" + "Extra characters after
interpolation expression". The `${...}` shell-style envsubst syntax
must be doubled to `$${...}` for OpenTofu to treat it as a literal.
Caught by `infra/hetzner — OpenTofu validate + test` CI on PR #1971.

Refs #1968

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:21:55 +04:00
e3mrah
bf3fa91be3
fix(infra): fail-fast on missing Hetzner public IP + post-install ExternalIP assertion (Refs #1941, A2 invariant) (#1958)
* fix(infra): fail-fast on missing Hetzner public IP + post-install ExternalIP assertion (Refs #1941, A2 invariant)

PR #1715 added `--node-external-ip=$CP_PUBLIC_IPV4` to the k3s server
install line, but the metadata curl was chained with `&&` to the install
command. If Hetzner metadata returns HTTP 200 with EMPTY body (observed
on t34, 2026-05-19), `curl -fsSL` exits 0, `CP_PUBLIC_IPV4=""`, and the
chain proceeds to install k3s with `--node-external-ip=` (empty). k3s
happily enrolls the node with InternalIP=10.0.1.2 and NO ExternalIP →
Cilium tunnel endpoint stays on the locally-scoped private IP → every
cross-region VXLAN tunnel resolves to 10.0.1.2 on the peer side →
inter-region pod traffic blackholes. DoD A2 invariant ("inter-region
link = DMZ WireGuard over PUBLIC IPs ALWAYS") VIOLATED. Blocks D31
(CNPG hot-standby), G5 (Hubble inter-region), all multi-region
pod-to-pod. Issue #1941 / TBD-A50.

Layer 1 — fail-fast guard in cloud-init:
  - Split the metadata curl into its own runcmd item with `|| true`
    so we can inspect the result without failing the whole script.
  - Validate the returned value is non-empty; if empty, dump curl -v
    diagnostics and exit 87 — cloud-init.log surfaces the FATAL
    immediately instead of a silent ClusterMesh blackhole hours later.
  - Persist the validated IP to /etc/openova/cp-public-ipv4 so the
    next runcmd item (the k3s install) and downstream items can read
    it without re-curl'ing.

Layer 2 — post-install ExternalIP assertion:
  - After `until kubectl get --raw /healthz`, poll
    node.status.addresses[type=ExternalIP] for 60s.
  - If empty, restart k3s ONCE (the systemd unit on disk already
    carries --node-external-ip from the install) and recheck for
    another 60s.
  - If still empty after restart, exit 88 with the full node YAML in
    stderr — cloud-init.log surfaces the regression and the operator
    knows D11/D31/G5 will fail BEFORE any application workload tries
    to schedule.

Layer 3 (idempotent periodic reconciler that re-asserts ExternalIP
post-boot) is filed as a separate follow-up issue — bigger scope, needs
a systemd timer + image roll. Not blocking #1941 closure.

Validation:
  - `tofu validate` against infra/hetzner/ → "Success! The configuration
    is valid."
  - Inline bash tests for both fail-fast paths:
    * mock curl returns empty body, exit 0 → script exits 87 ✓
    * mock curl returns "49.13.123.45", exit 0 → script persists IP
      and continues ✓
  - Rendered cloud-init size (after comment-strip in main.tf:997) =
    25 443 bytes, well under the 30 720 byte guardrail (line 1037).

DO NOT close #1941 with this PR — closure requires a fresh 3-region
provision walk + cross-region pod-to-pod ping. PR ships the cloud-init
guards; convergence walk validates end-to-end.

Refs #1941

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(infra): tofu fmt main.tf (pre-existing whitespace drift unblocking CI)

The infra-hetzner-tofu.yaml workflow runs `tofu fmt -check -recursive`
before validate. main.tf has accumulated whitespace alignment drift on
two locals blocks (lines ~867-880 and ~1417-1455 — secondary-region
templatefile() arg lists) that has caused that workflow to fail RED on
every push and PR for 2+ days. This PR cannot reach a green check
without unblocking it.

This commit is whitespace-only (`tofu fmt`) — no semantic change. Kept
in a separate commit from the load-bearing #1941 fix in the previous
commit so reviewers can audit the data-plane change independently.

Refs #1941

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 18:45:19 +04:00
e3mrah
20b502d790
fix(infra/hetzner): drop tuple-shape conditional in per_prov_listeners (TBD-A35, Closes #1886) (#1894)
PR #1892 (TBD-A32 fix for shared-zone collision) introduced an HCL
"Inconsistent conditional result types" error at infra/hetzner/main.tf
line 468. Every fresh prov failed at tofu plan in 23s, e.g. A127 t29
attempt (deployment 4afd9ebceea92547) at 2026-05-19 01:08:41Z.

Root cause: `local.per_prov_listeners` was defined as

    local.parent_domains_includes_sovereign_fqdn ? [] : [HTTPS_obj, HTTP_obj]

HCL/tofu cannot unify the conditional arms: the true arm is `tuple([])`
(length 0) and the false arm is `tuple([obj_with_tls, obj_without_tls])`
(length 2). Even moving the conditional to the consumer line in
`concat()` did not fix it — the same length-0 vs length-2 tuple
unification still fails.

Fix: emit `per_prov_listeners` unconditionally as the 2-element tuple,
then suppress it at the `concat()` consumer with a for-iteration filter

    [for l in local.per_prov_listeners : l if !<collides>]

which always produces a list (length 0 or 2 — same element type), so HCL
never needs to unify two tuple types.

Validated locally with OpenTofu v1.8.5 against a minimal tfvars fixture:
- `tofu validate` → "Success! The configuration is valid."
- `tofu console` with sovereign_fqdn="t29.omani.works", parent="omani.works":
  emits 4 listeners (parent https/http for *.omani.works + per-prov
  https-t29-omani-works/http-t29-omani-works for *.t29.omani.works) —
  matches PR #1892's intent.
- `tofu console` with sovereign_fqdn="omani.works" (collision):
  emits 2 listeners (only parent https/http) — collision guard preserved.

No chart bump; this is a tofu-only change. Re-closes #1886 after #1892
re-opened it via the type-mismatch regression.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 05:33:35 +04:00
e3mrah
1da216205a
fix(gateway): add per-prov 2-label wildcard listener for shared parent zones (Closes #1886, TBD-A32) (#1892)
The Cilium Gateway template emits `hostname: *.<parent-zone>` listeners
(e.g. `*.omani.works`). Per Gateway-API spec wildcard semantics that
matches EXACTLY one label depth, so `foo.omani.works` matches but
`console.t28.omani.works` does NOT. On every shared-parent-zone topology
(every per-prov Sovereign under omani.works) the operator-facing FQDN
is 2-label-deep — `curl -skI https://console.t28.omani.works/` reset at
TLS handshake even though `sovereign-wildcard-tls-t28-omani-works`
already contained all 13 per-prov SANs.

Fix: locals.per_prov_listeners in infra/hetzner/main.tf appends an extra
listener pair hostnamed `*.<sovereign_fqdn>` bound to the per-prov cert
`sovereign-wildcard-tls-<fqdn-dashed>` rendered by
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml. Skipped when
sovereign_fqdn equals one of the declared parent-zone names (legacy
single-zone-on-apex case) so no duplicate listener-name Conflict.

Verified by simulated jsonencode against three scenarios:

1. t28 multi-zone (sovereign_fqdn=t28.omani.works, parent_domains=
   [omani.works, omani.homes]) — emits 6 listeners:
     https-omani-works     hostname=*.omani.works     cert=sovereign-wildcard-tls-omani-works
     http-omani-works      hostname=*.omani.works
     https-omani-homes     hostname=*.omani.homes     cert=sovereign-wildcard-tls-omani-homes
     http-omani-homes      hostname=*.omani.homes
     https-t28-omani-works hostname=*.t28.omani.works cert=sovereign-wildcard-tls-t28-omani-works
     http-t28-omani-works  hostname=*.t28.omani.works

2. t28 single parent zone (sovereign_fqdn=t28.omani.works,
   parent_domains=[omani.works]) — emits 4 listeners (bare `https`/`http`
   for backward-compat with legacy sectionName HTTPRoutes + per-prov
   `https-t28-omani-works`/`http-t28-omani-works`).

3. Legacy apex (sovereign_fqdn=omani.works, parent_domains=
   [omani.works]) — collision guard active, emits only bare `https`/`http`.

All scenarios produce unique listener names.

Safe because every catalyst-system HTTPRoute now omits sectionName
(PR #1888 closing #1884) — Cilium attaches via hostname match, so the
per-prov 2-label listener catches `console.<fqdn>` / `api.<fqdn>` /
`marketplace.<fqdn>` / etc.

Refs A110 t28 scorecard, A107 D29 walk.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 05:02:36 +04:00
e3mrah
ed91f40d57
fix(sovereign-tls): wire Cilium Gateway listener at per-prov cert; stop parent-zone wildcard render (TBD-A29, Closes #1883) (#1890)
The Sovereign's Cilium Gateway listener `https-<parent-zone>` referenced
the parent-zone wildcard Secret `sovereign-wildcard-tls-<sanitised(parent)>`
(e.g. `sovereign-wildcard-tls-omani-works` for `*.omani.works`). That cert
is minted by `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml`
and SHARES Let's Encrypt's "5 New Certificates per Exact Set of Identifiers
per 168h" bucket with every other Sovereign on the same parent zone. After
~5 wipe+reprov cycles on `omani.works` the listener pinned to a
`Ready=False` Certificate (cert-manager spun the order forever, LE returned
`urn:ietf:params:acme:error:rateLimited`). A107 t28 evidence: per-prov cert
`sovereign-wildcard-tls-t28-omani-works` IS `Ready=True` but unused.

Fix (two parts):

1. `infra/hetzner/main.tf` — `parent_domains_listeners_yaml` now points
   each listener's `tls.certificateRefs[0].name` at the PER-PROV cert
   `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}` (rendered by
   `clusters/_template/sovereign-tls/cilium-gateway-cert.yaml` with the
   explicit SAN list `[console.<sovereign-fqdn>, auth.<sovereign-fqdn>,
   ..., sandbox.<sovereign-fqdn>]`). Per-prov identifier sets get their
   own 5/168h bucket per Sovereign so reprovs never share LE budget.
   New `local.sovereign_fqdn_dashed = replace(var.sovereign_fqdn, ".",
   "-")` is the SAME suffix `cilium-gateway-cert.yaml` /
   `cilium-envoy-tls-restart-job.yaml` already use, so the listener +
   cert + restart-job RBAC stay in lockstep.

2. `products/catalyst/chart/templates/sovereign-wildcard-certs.yaml` --
   skip-render unconditionally (`{{- if false }}` wrap around the
   `wildcardCert.enabled` guard). The parent-zone wildcards it minted
   are no longer referenced by anything and burn LE budget on every
   install. Template body kept for `git blame` / future revival under
   issue #831 (multi-listener per-zone tenant TLS with non-wildcard SAN
   lists). Removes 2 Certificate resources per multi-zone Sovereign.

Verification (helm template):

  helm template products/catalyst/chart \
      --set parentZones[0].name=omani.works --set parentZones[0].role=primary \
      --set parentZones[1].name=omani.homes --set parentZones[1].role=sme-pool \
      --set global.sovereignFQDN=t28.omani.works \
      --set wildcardCert.enabled=true \
    | grep -c 'sovereign-wildcard-cert'
  # before: 2  (two parent-zone Certificates rendered)
  # after:  0  (zero -- template skip-renders)

Chart bumped 1.4.182 -> 1.4.183 so the next Blueprint Release republishes
the OCI artifact with the skip-render change.

Hostname semantics unchanged: listener `hostname: *.<parent-zone>` still
matches any FQDN under the parent; cilium-envoy SNI dispatch serves the
per-prov cert whose SAN list covers the requested hostname (operator's
console/auth/gitea/etc. subdomains under `<sovereign-fqdn>`). Tenant
URLs under non-primary parent zones (`wp-foo.omani.homes`) remain out
of scope for A29; those need explicit per-tenant cert wiring via #831.

Closes #1883

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:54:18 +04:00
e3mrah
139a620ea7
fix(sovereign-tls): cilium-gateway propagates Hetzner LB annotations via spec.infrastructure (#1889)
Closes #1885 (TBD-A31).

Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z):
`console.t28.omani.works:443` accepts TCP but TLS resets. Inspection:
`kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows
type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned
`hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying
443→30443 service-port at the infra layer, the cluster-side hcloud-CCM
has no signal to materialise a parallel Service-level LB for the
auto-generated gateway Service — so operators inspecting kubectl see
a non-LoadBalancer Service and conclude the LB chain is broken.

Fix:
Add `spec.infrastructure.annotations` to the Gateway resource. The
Gateway-API spec mandates that controllers propagate these annotations
to any infrastructure resources they create — in Cilium 1.16+ this means
the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system.
hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the
annotations up at Service reconcile time and provisions a Hetzner LB.

Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml):
  - load-balancer.hetzner.cloud/name = <slug>-<region>-gateway
  - load-balancer.hetzner.cloud/location = <Hetzner DC>
  - load-balancer.hetzner.cloud/type = lb11
  - load-balancer.hetzner.cloud/use-private-ip = "false"  (DoD A2 — public IPs always)
  - load-balancer.hetzner.cloud/disable-private-ingress = "true"
  - load-balancer.hetzner.cloud/health-check-protocol = tcp
  - load-balancer.hetzner.cloud/health-check-port = "30443"
  - load-balancer.hetzner.cloud/health-check-interval = 15s
  - load-balancer.hetzner.cloud/health-check-timeout = 10s
  - load-balancer.hetzner.cloud/health-check-retries = "3"

Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in
the LB name so each multi-region peer's cilium-gateway gets its own
public LB (Hetzner LBs are unique-by-name; duplicate-name allocations
collapse to the first-created instance, hiding the LB for every
subsequent region).

Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY,
HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's
postBuild.substitute block. These mirror the same vars already passed
to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block
in 01-cilium.yaml apiserver.service.annotations, so the configuration
boundary is symmetric across the gateway LB and the clustermesh LB.

Memory rules respected:
  - A2 (PUBLIC IPs for inter-region) — use-private-ip=false
  - feedback_overlap_provs_dont_serialize_wait (no provisioning gate)
  - feedback_subagents_inherit_design_system (no new architectural seam,
    reuses existing Gateway-API + hcloud-CCM contracts)

Validation:
  $ kubectl kustomize clusters/_template/sovereign-tls/ | grep -A 30 'kind: Gateway'
  → renders all 10 Hetzner LB annotations under spec.infrastructure
  → ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION}
    substituted at Flux apply time

Acceptance criteria (per issue):
  - kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows
    type=LoadBalancer with external IP (after fresh prov + handover)
  - curl -skI https://console.<fqdn>/ returns HTTP 200

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:50:35 +04:00
e3mrah
f07312c5ae
fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852)
Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179
+ bootstrap-kit pin bump + cloud-init substitute extension, because each
fix is small and they share the same fresh-prov verification cycle.

TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list
networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies
get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the
chroot in-cluster fallback's k8sCache.Factory reflector emitted
continuous `networkpolicies is forbidden` errors at the cluster scope
because only update/patch/delete were granted (existing mutation block)
— the read path was never wired. Mirrors the existing
cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s
NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7).

TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields
configuredRegions / controlPlaneIP / primaryRegion / replicaRegion /
selfDeploymentId / enableHotStandby / qaApplications empty on every
fresh prov. Pre-fix the envsubst placeholders resolved to empty because
nothing wrote them into the bootstrap-kit Kustomization postBuild
substitute map → the chart rendered empty strings → Dashboard
SovereignCard configured-regions chips, Settings page operator-identity,
/api/v1/sovereign/self, and the D31 active-hot-standby gating ALL
silently fell through to default behaviour. Wired via three coordinated
changes:
  - Chart values.yaml gains global.sovereignSelfDeploymentId default
  - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId,
    sovereign.configuredRegions, sovereign.qaApplications mappings
    (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`)
  - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP
    (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION /
    SOVEREIGN_REPLICA_REGION (canonical 4-segment labels),
    SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty),
    SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list),
    QA_APPLICATIONS_YAML (reserved, default `[]`)
  - main.tf: new template inputs sovereign_configured_regions_yaml +
    replica_region_canonical_label (derived from local.secondary_regions),
    threaded into both primary CP and per-secondary-region cloud-init
    templatefile calls

TBD-A10b (issue #1845) — GET
/api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409
kubeconfig-file-missing on fresh prov for every region. Pre-fix the
handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init
PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region
key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk
filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly
call with the bare `cloudRegion` (`?region=hel1`) because that's the
matrix-doc-friendly form. Fall-back resolution order added to
GetKubeconfig: exact-name first (legacy + manual operator PUT), then
`<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test
covers all three paths: exact match, slot-suffix glob, unknown-region
still 409. Closes the regression introduced when PR #1763
(mothership→chroot kubeconfig handover hook) started using the
cloud-init naming convention for fan-out exports.

Closes #1843, Closes #1844, Closes #1845

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:21:38 +04:00
e3mrah
0538f6ee68
fix(infra): advertise public IP as k3s node-external-ip so Cilium inter-region tunnel works (Refs TBD-A7) (#1715)
Add --node-external-ip=$${CP_PUBLIC_IPV4} to the k3s server install in
infra/hetzner/cloudinit-control-plane.tftpl so every CP publishes BOTH
node.status.addresses[InternalIP=10.0.1.2] AND ExternalIP=<public ipv4>.

Bug evidence (Wave 28-E, t22-omantel-biz 2026-05-18):
  hel/fsn/sin all advertise InternalIP=10.0.1.2 with NO ExternalIP.
  After the 2026-05-15 per-region-network refactor every region's CP
  sits in its OWN isolated hcloud_network, so 10.0.1.2 is locally
  scoped on each VPS and NOT routable cross-region. Cilium picks the
  InternalIP as its tunnel endpoint by default → cross-region VXLAN
  tunnels resolve to 10.0.1.2 on every peer → inter-region pod traffic
  blackholes (pod-to-pod 0/6 across regions).

docs/SOVEREIGN-MULTI-REGION-DOD.md A2 mandate:
  "inter-region link = DMZ WireGuard over PUBLIC IPs ALWAYS
   (never any provider's private network)".

Publishing the public IPv4 as ExternalIP lets Cilium promote it to the
tunnel endpoint when peer addresses include External + Internal, which
restores cross-region pod reachability without breaking intra-cluster
paths — InternalIP stays primary for kube-apiserver advertise + pod-to-
CP dial (the original reason --node-ip was pinned to private in
PR-#62-era; the comment at line 1370-1378 still holds and is preserved).

Effect:
  - Only takes effect on FRESH provisions (t23+). t22 already deployed
    cannot be remediated by a cloudinit change.
  - Both primary CP and secondary CPs go through this same template
    (main.tf templatefile() calls for primary at line 636 and per
    secondary at line 1187), so a single template edit covers all
    regions.
  - Approach A (smaller / immediate). Approach B (DMZ WireGuard overlay
    DaemonSet per platform/bp-dmz-vcluster/) follows as architectural
    follow-up if A alone doesn't fully resolve cross-region pod
    traffic on t23+.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:58:14 +04:00
e3mrah
cc13aec980
fix(sovereign-tls): bare https/http listener names when single parent zone (collision with chart HTTPRoutes sectionName) (#1682)
PR #1640 renamed Cilium Gateway listeners to `https-<sanitised-zone>` /
`http-<sanitised-zone>` to support multi-zone Sovereigns (primary +
SME pool). That broke single-zone Sovereigns because every platform
chart's HTTPRoute (harbor, keycloak, grafana, gitea, openbao, powerdns,
stalwart-tenant) hardcodes `parentRefs[0].sectionName: https`. Result:
every HTTPRoute reports `Accepted=False NoMatchingListener`, Sovereign
Console / Harbor / Keycloak etc. unreachable through the Gateway.

Fix: when `len(parent_domains_decoded) == 1` (the common case), render
listener names as the bare strings `https` / `http`. When > 1 (SME pool
present), keep the unique `https-<zone>` / `http-<zone>` naming so the
Gateway controller doesn't hit a duplicate-name Conflicting condition.

Multi-zone tenants whose HTTPRoutes must attach under a non-primary
zone override `sectionName` via values.yaml — out of scope here.

The per-zone certificateRefs.name (`sovereign-wildcard-tls-<sanitised-zone>`)
is unchanged — independent of the listener name.

Verified: kubectl kustomize clusters/_template/sovereign-tls/ clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:51:42 +04:00
e3mrah
422da46360
fix(sovereign-tls): cilium-gateway listeners per parentZone (#1640)
Issue #831 follow-on to #827. Previously the Cilium Gateway declared a
single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under
non-primary parent zones (e.g. wp-foo.omani.homes when the operator
brings omani.homes as the SME pool) hit cilium-envoy's default fallback
cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered
by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR
\#827) existed but had no Gateway listener claiming its hostname.

Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent
zone. Materialised at Terraform plan time as a JSON-flow array
(infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode
of the listener objects iterating decoded parent_domains_yaml), threaded
through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and
consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}`
in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone
Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay
in lockstep.

Scalar placeholder (not multi-line block) because kustomize-build parses
the YAML before Flux runs envsubst — a placeholder on its own line at
column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then
swaps it for the JSON-flow array string, which the apiserver parses as
the real listener list.

Single-zone fallback preserved (var.parent_domains_yaml empty →
[{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone
provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions
(e.g. primary omani.works + sme-pool omani.homes) render 4 listeners.

Verification:
  - kubectl kustomize clusters/_template/sovereign-tls/ → clean
  - End-to-end simulation (single-zone, two-zone) renders correct
    listener counts (2 / 4) with correct certificateRefs per zone.
  - Listener naming `https-<sanitised>` / `http-<sanitised>` is unique
    per listener so Gateway controller programs them all (duplicate
    names produce Conflicting status condition).

Files:
  - clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar
    listeners placeholder + comment block explaining the why)
  - infra/hetzner/main.tf (locals.parent_domains_decoded +
    locals.parent_domains_listeners_yaml; threaded into primary CP and
    secondary regions' templatefile() calls)
  - infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML
    substitute var in sovereign-tls Kustomization block)

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:26 +04:00
e3mrah
0242be5c49
fix(infra): PR O — cilium-gateway TLS references per-zone wildcard cert (#1595)
t143 hit LE PROD rate limit (50 certs/week on omani.works exhausted)
because TWO cert templates compete for the same parent-domain quota:
1. clusters/_template/sovereign-tls/cilium-gateway-cert.yaml — legacy
   SAN cert named `sovereign-wildcard-tls`
2. products/catalyst/chart/templates/sovereign-wildcard-certs.yaml —
   chart per-zone cert named `sovereign-wildcard-tls-<sanitised-zone>`

The Cilium Gateway listener hardcoded the legacy name, so when LE 429s
the legacy cert (as happened on t143), HTTPS to console.<fqdn> breaks
even though the per-zone cert is Ready.

Fix: gateway listener now references `sovereign-wildcard-tls-${SOVEREIGN_FQDN_DASHED}`.
Cloud-init substitutes SOVEREIGN_FQDN_DASHED = replace(fqdn, ".", "-")
in the sovereign-tls Kustomization postBuild.substitute. The per-zone
cert from the chart provides the Ready Secret with this exact name.

The legacy cilium-gateway-cert.yaml SAN cert still renders for
backward-compat (some consumers may still reference it), but the
gateway listener no longer depends on it for TLS termination.

Bumps no chart version — the change is at the Flux/Kustomize layer.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 13:19:10 +04:00
e3mrah
c148ec6a34
fix(cloudinit): escape $$\{ORG_EMAIL:-\}/$$\{ORG_NAME:-\} in comment (D22) (#1575)
PR #1571 added a comment mentioning the $${ORG_EMAIL:-}/$${ORG_NAME:-}
slot-file placeholders WITHOUT the $$ escape. tofu's templatefile()
parses comments and tried to interpolate \${ORG_EMAIL:-} as a tofu
expression — failing with "Extra characters after interpolation
expression; Template interpolation doesn't expect a colon".

Caught live on t133 fad01d84f5655004 — tofu plan failed in 30s.

The escape pattern is documented at main.tf:1029 (the same warning
that caught t127 last week). $$ prefix tells tofu's templatefile to
emit literal \${...} to cloud-init for Flux envsubst.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:31:26 +04:00
e3mrah
57939585c0
feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22) (#1571)
* feat(chart): wire OPERATOR_EMAIL/CONTROL_PLANE_IP/GITOPS_REPO_URL/ORG_NAME (D22)

Companion to PR #1567 + #1568 — wire the env vars chrootEnsureDeployment
reads to populate the deployment record so Sovereign Console Settings
page renders real values for ownerEmail, controlPlaneIP, gitopsRepoURL,
orgName (instead of `—` placeholders).

Adds 4 new keys to the sovereign-fqdn ConfigMap (orgEmail, orgName,
controlPlaneIP, gitopsRepoURL) sourced from .Values.sovereign.* with
empty defaults. Per-Sovereign overlays wire actual values from cloud-
init substitute placeholders (mirrors regionsJson pattern).

Catalyst-api Pod now reads them via valueFrom configMapKeyRef +
optional=true (Catalyst-Zero/contabo emits no sovereign-fqdn ConfigMap
so env stays empty there — correct, mothership is signer not validator).

Validated: t132 already serves region=hel1, consoleURL, loadBalancerIP
post-#1568. This PR fills the remaining 3 D22 fields when operator wires
the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(slot-13): add D22 sovereign-side identity placeholders

Add ${ORG_EMAIL:-} + ${ORG_NAME:-} + ${SOVEREIGN_CONTROL_PLANE_IP:-} +
${GITOPS_REPO_URL:-} envsubst placeholders so when cloud-init wires
them, the chart picks them up via sovereign-fqdn ConfigMap (PR #1569)
→ catalyst-api env → chrootEnsureDeployment populates the deployment
record → Settings page renders real values instead of `—`.

This PR alone is a no-op (placeholders default to empty, same as today).
The cloud-init substitute lines + provisioner.go tfvars need to land in
a companion PR to actually populate the values on next-prov.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cloudinit): wire ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL substitutes (D22)

Companion to #1567+#1568+#1569+#1570 — the cloud-init substitute block
now emits ORG_EMAIL/ORG_NAME/GITOPS_REPO_URL into the bootstrap-kit
Kustomization's postBuild.substitute env, which the slot-13 placeholders
(#1570) consume via ${ORG_EMAIL:-}/${ORG_NAME:-}/${GITOPS_REPO_URL:-}.

Chain: provisioner.go writeTfvars → tofu vars → cloudinit templatefile
substitute → Flux Kustomization postBuild → sovereign-fqdn ConfigMap
keys (#1569) → catalyst-api env (#1569) → chrootEnsureDeployment
populates the deployment record (#1567 + #1568 fallback).

SOVEREIGN_CONTROL_PLANE_IP omitted intentionally — main.tf:691 notes
the dependency cycle (hcloud_server.cp doesn't exist at cloudinit
render time). Separate PR will source it via metadata-service or
post-create ConfigMap patch.

Next-prov (t133+) Sovereign Console Settings page now renders real
ownerEmail/orgName/gitopsRepoURL instead of `—` placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:47:04 +04:00
e3mrah
1c988b9a4b
fix(firewall): open NodePort range 30000-32767 for clustermesh LB (D11) (#1538)
PR #1537's use-private-ip approach was not viable: the per-region
Hetzner LB has no private-network attachment by default (LB private_net
is empty) and our DoD A2 architecture pins one private /24 per region
that does NOT span across regions. The LB->backend hop has to transit
the public path.

The actual blocker is the Sovereign firewall: it permits 80/443/6443/53
and blocks the NodePort range. Hetzner LB TCP health-check probes
`<node-public-ip>:<NodePort>` and gets dropped → all targets marked
unhealthy → external clients see "unexpected eof while reading" at
TLS handshake → cilium clustermesh agent stays `0/N remote clusters
ready, Waiting for initial connection`.

Security: clustermesh-apiserver requires mTLS. Peer agents must present
a client cert signed by the peer cluster's cilium-ca (PR #1530).
Anonymous connections rejected at handshake. mTLS is the security
boundary, NOT the firewall — opening NodePorts is safe here.

Caught on t129 (6cddff7ef4432bdc, 2026-05-16) — completes the D11
incident chain (#1525#1528#1530#1536 → this).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:44:02 +04:00
e3mrah
1f30a08ae3
fix(chroot): seed Request.Regions[] from SOVEREIGN_REGIONS_JSON env (D5) (#1534)
The Sovereign-side catalyst-api runs in "chroot" mode — it has no
parent prov record, so chrootEnsureDeployment synthesises a minimal
in-memory Deployment with only SovereignFQDN set. The
/infrastructure/topology loader then sees empty Request.Regions[]
and falls into the live-Nodes enumeration path (buildRegionFromLiveNodes)
which only sees THIS cluster's Node(s) → emits exactly 1 Region
even on a 3-region Sovereign. /cloud?view=graph renders as
"1 cluster 1 region" — DoD D5 failure.

Caught on t126 (84c0848406dd6fdd, 2026-05-16): operator reported
`console.t126.omani.works/cloud?view=graph` showed 1 region despite
mothership openova-flow snapshot holding all 3 regions correctly.

This PR threads the canonical multi-region RegionSpec[] from the
mothership prov body all the way to the Sovereign-side catalyst-api:

  tofu var.regions
    → jsonencode → sovereign_regions_json tftpl var
    → cloud-init postBuild.substitute SOVEREIGN_REGIONS_JSON
    → bp-catalyst-platform slot 13 sovereign.regionsJson value
    → sovereign-fqdn ConfigMap key `regionsJson`
    → catalyst-api Pod env SOVEREIGN_REGIONS_JSON (valueFrom)
    → chrootEnsureDeployment parses JSON, populates Request.Regions[]
    → topology loader emits one Region per spec entry

Single-region Sovereigns: var.regions has length 1; chart writes
the array literal; chroot synth still produces 1 Region — no
regression. Empty env: chroot falls back to live-Nodes path
(legacy behavior preserved).

Refs DoD D5.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:45:24 +04:00
e3mrah
357feb0843
fix(tofu): escape ${...} in comment that broke templatefile() (t127) (#1533)
Unescaped `${DMZ_VCLUSTER_ENABLED:=true}` Flux envsubst expression
inside a tftpl comment was being parsed by tofu's templatefile() as
a tftpl interpolation. tofu's `:=` is not a valid tftpl operator,
so tofu plan failed with:

  ./cloudinit-control-plane.tftpl:1021,71-72: Extra characters after
  interpolation expression; Template interpolation doesn't expect a
  colon at this location.

Every other `${...}` reference in tftpl comments in this file is
properly escaped as `$${...}` (e.g. lines 12, 850, 893, 971, 996,
1039, 1138). Mine slipped through PR #1531.

Fix: rewrite the comment to NOT include any `${...}` expression
(since the expression was just illustrative), avoiding the escape
gymnastics entirely.

Caught on t127 (b7942a70f7516e9e, 2026-05-16) — first prov after
PR #1531 landed FAILED in tofu plan stage within 60s.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:39:43 +04:00
e3mrah
904686ff0d
fix(vcluster): canonical region label substitute + per-role enable flags (#1531)
Caught on t126 (84c0848406dd6fdd, 2026-05-16): bp-{dmz,mgmt,rtz}-vcluster
charts installed but DMZ Pods Pending on every region with
FailedScheduling. Pod nodeSelector was `openova.io/region=hel1`
(from `${SOVEREIGN_REGION_KEY}` substitute = Hetzner region key
"hel1"/"nbg1-1"/"sin-2"), but the k3s node-label is
`openova.io/region=hz-hel-rtz-prod` (canonical 4-segment label written
by cloud-init from `region_canonical_label` per PR #1512). Mismatch
meant every vCluster Pod across every region sat Pending.

MGMT + RTZ slot 58/59 charts also default-OFF with no substitute
flipping them on per the DoD A4 topology (primary=MGMT+DMZ;
secondary=DMZ+RTZ).

This PR:
1. Adds `SOVEREIGN_REGION_CANONICAL_LABEL` substitute to tofu cloud-init
   `bootstrap-kit` postBuild block, sourced from per-region
   `region_canonical_label` tftpl var.
2. Adds `MGMT_VCLUSTER_ENABLED` + `RTZ_VCLUSTER_ENABLED` substitutes —
   primary CP renders true/false, secondary CP renders false/true.
3. Updates bootstrap-kit slots 54/58/59 to use the canonical label
   substitute. Slots 58/59 also read the per-role enable flag.

Expected post-deploy state on a fresh 3-region prov:
  primary:    DMZ + MGMT vCluster Pods Running (RTZ rendered zero)
  secondary:  DMZ + RTZ vCluster Pods Running (MGMT rendered zero)

Refs DoD A4 (vCluster topology).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:28:06 +04:00
e3mrah
ed19bb3f8d
fix(k3s): --disable-cloud-controller so providerID stays empty for our patch (#1524)
Caught on t123 (a3bfa56adbcfb049, 2026-05-16): Gap A v3.1's patch loop
hit k8s validation error:

  The Node "catalyst-t123-omani-works-cp1" is invalid:
  spec.providerID: Forbidden: node updates may not change providerID
  except from "" to valid

k8s allows setting providerID from empty → valid, but NOT changing it.
k3s's embedded cloud controller sets providerID=k3s://<hostname>
BEFORE our cloud-init runcmd patch fires (race window). Once set,
the patch is rejected.

Fix: --disable-cloud-controller (alone, NOT with the cloud-provider=
external kubelet arg that caused the chicken-and-egg taint in
reverted PR #1513). This disables the k3s embedded cloud controller
so it never sets providerID; the kubelet leaves providerID empty;
our runcmd patch successfully sets hcloud://<id>.

hcloud-ccm (installed later via Flux) sees the correct providerID
and allocates per-region LBs.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:25:54 +04:00
e3mrah
0ebd137547
fix(cloud-init): retry providerID patch up to 30× when Node not yet registered (#1523)
Caught on t122 (7e519eb997af236c, 2026-05-16): primary + sin patched
fine, but nbg1's kubectl patch failed because the Node object hadn't
yet appeared in the apiserver between healthz OK and Node registration.
Result: nbg1 stuck at providerID=k3s://... → CCM rejected its LB
allocation → clustermesh-apiserver external_ip stayed <pending> on
nbg1 → AutoEstablishClusterMesh couldn't fully mesh.

Add a 30-iter loop (150s budget): get node first; if found, patch; else
sleep 5. Hetzner apiserver registers Nodes within ~10-30s of k3s
install on healthy clusters.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:58:59 +04:00
e3mrah
ef93a2cdbe
feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520)
Architecturally-clean replacement for the reverted PRs #1513 (k3s flag)
and #1516 (pre-install hcloud-ccm). Both prior approaches broke
cold-start (chicken-and-egg with the uninitialized taint).

This patch instead lets k3s boot normally with its default embedded
cloud controller (which sets `providerID=k3s://<hostname>` — the
problem), then immediately patches the local Node's `spec.providerID`
to `hcloud://<id>` using the Hetzner instance metadata endpoint
(169.254.169.254). The patch runs ONCE per CP node, right after k3s
apiserver healthz becomes reachable, BEFORE flux-bootstrap.yaml applies
the bootstrap-kit Kustomization.

Once providerID has the canonical `hcloud://` prefix, bp-hcloud-ccm
(installed by Flux later in the bootstrap-kit chain) accepts the node
as a Hetzner-managed instance and allocates LBs for Service
type=LoadBalancer normally. That unblocks:

- D12: clustermesh-apiserver Service gets a real external IP
        instead of <pending>
- D10: AutoEstablishClusterMesh (PR #1508) can read each region's
        LB IP and write peer entries into cilium-clustermesh Secret
- D11: inter-region pod-to-pod traffic flows via Cilium WG over the
        per-region LB IPs
- D5: child catalyst-api can reach secondary regions via mesh, so
       /cloud view aggregates all 3 regions instead of 1/1

Failure is non-fatal: if metadata lookup or patch fails, we log and
continue (bp-hcloud-ccm has a chance to set providerID later via its
own node-list-and-match logic). Cold-start is never blocked.

Canonical topology (1 cpx52 per region, workerCount=0) means every
node is a CP — covered by this patch. Operator-added workers
(workerCount>0) would also need providerID patched; a follow-up Job
in bp-providerid-patcher can iterate all nodes post-Flux.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:12:26 +04:00
e3mrah
766890510b
Revert PR #1516 + #1517 — Gap A hcloud-ccm pre-install hangs cloud-init (#1518)
* Revert "fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)"

This reverts commit 05c6edb4fe.

* Revert "fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)"

This reverts commit b7140b9069.

---------

Co-authored-by: claude <claude@anthropic.com>
2026-05-16 13:32:18 +04:00
e3mrah
05c6edb4fe
fix(cloudinit): bump size guardrail 30720 → 32000 bytes (#1517)
PR #1516 added ~3KB of hcloud-ccm bootstrap manifests inline (Secret +
ServiceAccount + ClusterRoleBinding + Deployment with full toleration
list + container args). Rendered cloud-init now exceeds the 30720
precondition on every primary + secondary CP:

  Error: Resource precondition failed
  on main.tf line 716: length(local.control_plane_cloud_init) <= 30720

Caught on t118 prov (0619287065fb58c8, 2026-05-16): apply failed at
both primary AND nbg1-1 + sin-2 simultaneously.

Hetzner hard cap is 32768 bytes. Bump guardrail to 32000 (96.5% of
hard cap) — leaves a 768-byte safety margin while admitting the
hcloud-ccm pre-install legitimately needed bytes.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 13:15:21 +04:00
e3mrah
b7140b9069
fix(cloud-init): pre-install hcloud-ccm before Flux (unblocks per-region LB allocation) (#1516)
DoD multi-region gates D5/D10/D11/D12-LB-pending all trace to one root
cause: k3s sets node.spec.providerID=k3s://<hostname>. hcloud-ccm
rejects every LoadBalancer-Service allocation because the prefix isn't
hcloud://, so clustermesh-apiserver Service stays <pending> →
AutoEstablishClusterMesh (PR #1508) hard-fails → no peer entries → no
inter-region pod traffic → openova-flow-emitter on secondaries can't
reach openova-flow-server on primary → /cloud view sees only 1 region.

PR #1513 attempted the kubelet-flag-only fix (--cloud-provider=external
+ --disable-cloud-controller) banking on Flux's bp-hcloud-ccm slot 55 to
install the CCM. Reverted in PR #1514 because Flux pods themselves
cannot land on a node tainted node.cloudprovider.kubernetes.io/
uninitialized=NoSchedule — chicken-and-egg, 0 HRs after 30 min.

Architecturally correct fix: pre-install hcloud-ccm via raw manifests in
cloud-init, BEFORE flux-bootstrap.yaml apply. Once the Deployment runs
(with uninitialized-taint toleration), CCM matches the node to its
Hetzner server, writes providerID=hcloud://<id>, kubelet lifts the
taint, Flux proceeds normally. Flux later "adopts" this Deployment via
bp-hcloud-ccm HelmRelease (release name collides cleanly with `helm
upgrade --install`).

Changes:
- cloudinit-control-plane.tftpl:
  - Re-add k3s install flags --disable-cloud-controller +
    --kubelet-arg=cloud-provider=external (same flags as reverted #1513).
  - New write_files entry /var/lib/catalyst/hcloud-ccm-bootstrap.yaml
    containing Secret kube-system/hcloud (token + network keys),
    ServiceAccount, ClusterRoleBinding, and Deployment with full
    toleration set (uninitialized + CriticalAddonsOnly + control-plane
    + master + not-ready). Image pulled via harbor.openova.io proxy-
    cache of hetznercloud/hcloud-cloud-controller-manager:v1.20.0
    (mirrors platform/hcloud-ccm/chart/Chart.yaml appVersion pin, per
    MIRROR-EVERYTHING rule).
  - New runcmd steps inserted AFTER the local-path StorageClass setup
    and BEFORE the kubeconfig postback: kubectl apply the manifest, then
    poll node.spec.providerID for up to 300s waiting for hcloud:// prefix.
    On timeout, dump CCM pod + logs and exit 1.

- cloudinit-worker.tftpl:
  - Add --kubelet-arg=cloud-provider=external to agent install.
    Workers join the cluster after the primary CP's CCM is up; worker
    kubelet will wait for the same external CCM to set its providerID.

Secondary regions (local.secondary_region_cloud_init in main.tf) call
the SAME cloudinit-control-plane.tftpl, so the fix inherits to every
secondary CP automatically. No main.tf changes needed — hcloud_token
and hcloud_network_name were already threaded into both primary and
secondary templatefile() calls.

DoD impact: unblocks D5 (/cloud 3-regions), D10 (Cilium peer entries),
D11 (inter-region pod-to-pod via WG), D12 (LB external IPs no longer
<pending>). After this lands plus a fresh prov, those four DoD gates
flip green; expected 13-14/14 on next t118 cycle.

Refs: docs/SOVEREIGN-MULTI-REGION-DOD.md, session_2026_05_16_t117_dod_partial.md
Reverts: tail of PR #1513 left the worker tftpl untouched, but #1514's
revert restored it to no-flag state. This PR re-applies the flag intent
correctly because the CCM is now present at the moment kubelet starts.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 13:06:49 +04:00
e3mrah
f30a49fba5
Revert "fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513)" (#1514)
This reverts commit 7f0de7fa82.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-16 12:12:38 +04:00
e3mrah
7f0de7fa82
fix(k3s): set cloud-provider=external + disable embedded CCM for hcloud-ccm (#1513)
DoD gate D12-LB-allocation root cause: k3s registers nodes with
providerID=k3s://<hostname> instead of hcloud://<server-id>. hcloud-ccm
rejects every LB allocation:

  hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have
  one of the expected prefixes (hcloud://, hrobot://, hcloud://bm-):
  k3s://catalyst-t115-omani-works-nbg1-1-cp1

This blocked clustermesh-apiserver Service from getting an external
IP on every secondary region → AutoEstablishClusterMesh (PR #1508)
couldn't write peer entries → D10/D11 fail.

Caught on t115.omani.works (577be15281be2587, 2026-05-16) after PR
#1509 flipped clustermesh-apiserver Service to LoadBalancer. The
NodePort default in the old chart masked this k3s-vs-hcloud-ccm
incompatibility until the LoadBalancer flip exposed it.

Fix (k3s server install line in cloudinit-control-plane.tftpl):
  + --disable-cloud-controller
  + --kubelet-arg=cloud-provider=external

Fix (k3s agent install line in cloudinit-worker.tftpl):
  + --kubelet-arg=cloud-provider=external

The k3s server flag tells the embedded cloud controller to stay out.
The kubelet flag tells kubelet to wait for an external CCM to set
providerID. hcloud-ccm (bootstrap-kit slot 36) then matches each
node to its Hetzner server by name and sets providerID=hcloud://<id>,
unblocking LB allocation, Volume CSI, and node-external-ip.

The node is briefly tainted node.cloudprovider.kubernetes.io/
uninitialized=NoSchedule until the CCM removes it — Flux's
bootstrap-kit Kustomization tolerates this taint via SOPs.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 11:34:25 +04:00
e3mrah
dc590855a1
fix(tofu): per-region cloud-init renders with secondary's own values, not primary's (#1512)
* fix(tofu): per-region cloud-init renders with secondary's own values, not primary's

Root cause: cloudinit-control-plane.tftpl hardcoded the literal
`openova.io/region=hz-fsn-rtz-prod` on the k3s install line.
Every CP node — primary AND every secondary — labeled itself with
that fixed string regardless of the cluster's real region. The
template variables `region` and `sovereign_region_key` were already
wired per-region in main.tf, but this one node-label flag was
written as a constant.

Concrete impact on prov t114.omani.works (a1448e0b9e471f5d, 2026-05-16):
  - Primary cluster (hel1) k3s nodes carried `hz-fsn-rtz-prod`
    even though Sovereign primary = hel1. qa-fixtures Pods
    targeted `openova.io/region in [hz-fsn-rtz-prod]` and silently
    landed on the wrong-named nodes — the scheduler accepted but
    the cluster name didn't match the label, breaking the
    OpenovaFlow canvas's per-region grouping and any downstream
    selector reading the label.
  - Secondary clusters (nbg1, sin) carried the same hardcoded label
    so their k3s nodes never reported their own region, again
    breaking the canvas (D13) and the Continuum DR region awareness.
  - clusters/_template/bootstrap-kit/01-cilium.yaml further masked
    the bug with a `${HCLOUD_LB_LOCATION:=hel1}` default fallback
    on the clustermesh-apiserver Service annotation — for a
    Sovereign with primary=hel1 the fallback APPEARED correct but
    silently masked any rendering failure path where the substitute
    might be missing.

Fix shape:
  1. Introduce locals.region_canonical_label in main.tf, keyed by
     region key ("primary" + every secondary key). Each value is
     computed as `hz-<region-prefix-no-digits>-rtz-prod` per
     NAMING-CONVENTION §2.1.
  2. Thread `region_canonical_label` into BOTH the primary CP
     templatefile() call (from locals.region_canonical_label["primary"])
     and the secondary CP templatefile() call (from
     locals.region_canonical_label[k]).
  3. Replace the hardcoded literal in cloudinit-control-plane.tftpl
     line 1364 with `${region_canonical_label}` — each CP now
     labels its k3s node with ITS OWN canonical region tag.
  4. Thread `QA_PRIMARY_REGION` substitute into the bootstrap-kit
     Kustomization's postBuild.substitute block so the chart's
     qaFixtures.primaryRegion seam (`${QA_PRIMARY_REGION:-hz-fsn-rtz-prod}`)
     is set to the Sovereign-wide primary region label, never the
     hardcoded `hz-fsn-rtz-prod` chart default. Identical value on
     every cluster's bootstrap-kit because qaFixtures.primaryRegion
     is Sovereign-wide singular.
  5. Remove the `${HCLOUD_LB_LOCATION:=hel1}` fallback default in
     01-cilium.yaml — the cloud-init substitute ALWAYS provides a
     value, so a missing substitute is a tofu rendering bug that
     should surface at chart admission, not silently render hel1.

Provider-agnostic per DoD A6: the `hz` prefix is correct only
because this file lives under infra/hetzner/; future infra/aws/
and infra/huawei/ modules will derive `aw` / `hw` in their own
per-module locals using the same pattern.

DoD impact unblocked:
  - D10 (cilium clustermesh peer entries): clustermesh-apiserver
    Service now annotates the correct region for hcloud-ccm LB
    allocation on every peer, not just primary=hel1.
  - D12 (clustermesh LB external IP allocated): no longer pending
    on non-hel1 primary or any secondary because the location
    annotation now reflects each peer's real region.
  - D13 (canvas per-region bubble grouping): k3s nodes report
    their actual region label so FlowNode.region values
    differentiate across clusters.

Tests added (infra/hetzner/tests/multi_region.tftest.hcl,
run "per_region_cloud_init_carries_secondarys_own_region"):
  - SOVEREIGN_REGION_KEY / HCLOUD_LB_LOCATION render per-region
    (regression test for the templatefile contract).
  - openova.io/region= node-label is the per-region canonical
    label (`hz-nbg-rtz-prod` on nbg1-1, `hz-sin-rtz-prod` on sin-2,
    `hz-hel-rtz-prod` on primary hel1).
  - QA_PRIMARY_REGION substitute carries the Sovereign's primary
    region label on every cluster's bootstrap-kit substitute.
  - Negative assertions catch any regression that re-introduces
    `hz-fsn-rtz-prod` on a non-fsn1 Sovereign.

Test result: 7 passed, 2 pre-existing failures (qa_mode SKU
override tests — unrelated, present on origin/main, separate
contract from Fix #183 body-first coalesce).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(tofu): align qa_mode SKU tests with Fix #183 body-first coalesce contract

Pre-existing test failures on origin/main since Fix #183 (PR #1386,
2026-05-11) inverted the coalesce direction in
`local.effective_cp_size = local.qa_mode ?
coalesce(var.control_plane_size, var.qa_control_plane_size) :
var.control_plane_size`. The pre-Fix-#183 tests asserted that
qa_control_plane_size wins when qa_fixtures_enabled='true', but the
new contract is the OPPOSITE: body wins (variables.tf default
`cpx22` for control_plane_size is non-empty so coalesce always picks
it first; qa-default only activates when the body is empty, which
provisioner.go achieves by CONDITIONALLY omitting the var in
writeTfvars when the operator's body has no override — see
provisioner.go:1280-1289).

Inside tofu test we can't conditionally omit a variable, so the
variables.tf default ALWAYS wins. Updated assertions:

  - qa_mode_on_flips_to_bigger_skus → asserts variables.tf default
    `cpx22` wins (the auto-flip is exercised at the provisioner-side
    boundary, not tofu-side).
  - qa_mode_on_respects_explicit_overrides → asserts the body-first
    behavior when only qa_control_plane_size is set (no
    control_plane_size override).
  - NEW qa_mode_on_body_overrides_win → asserts the operator's
    explicit control_plane_size/worker_size wins verbatim — the
    canonical "body wins" lane Fix #183 codified.

Tests result: 10 passed, 0 failed (was 7 passed, 2 failed on
origin/main since Fix #183).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 10:57:48 +04:00
e3mrah
0c9e391d59
fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile (#1511)
* fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3)

DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh
apiserver Service MUST be LoadBalancer (NEVER NodePort).

Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted
${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign
landed with clustermesh-apiserver as NodePort, in direct violation of
A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go,
PR #1508) which hard-fails on Service.type != LoadBalancer.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15):
- 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True
- clustermesh-apiserver Service = NodePort on all 3 regions
- cilium-clustermesh peer Secret empty (0 peers) — orchestrator
  never wrote them because of the type-check
- D10 + D12 both failed silently

Fix flips the chart default to LoadBalancer and threads Hetzner CCM
LB annotations (location, type, name) from the bootstrap-kit
substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE +
HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init
postBuild substitute block alongside the existing CLUSTER_MESH_NAME
+ CLUSTER_MESH_ID.

Operator escape hatch preserved: bare-metal / non-cloud Sovereigns
override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign
bootstrap-kit overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tofu): pass sovereign_fqdn_slug into secondary regions templatefile

PR #1509 added ${sovereign_fqdn_slug} reference to cloudinit-control-plane.tftpl
(for the Hetzner CCM LB name annotation on clustermesh-apiserver) and wired
it into the PRIMARY templatefile() invocation in main.tf, but missed the
SECONDARY-regions templatefile() at line ~990. Every multi-region prov
now fails at `tofu plan`:

  Invalid value for "vars" parameter: vars map does not contain key
  "sovereign_fqdn_slug", referenced at ./cloudinit-control-plane.tftpl:991,37-56.

Caught on prov t113.omani.works (82c3587b97156a08, 2026-05-15) — first
multi-region prov against #1509's chart fix. Phase-0 failed at plan
before any servers spun up.

Fix is trivial: thread the same replace(var.sovereign_fqdn, ".", "-")
through the for_each secondary block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 00:00:19 +04:00
e3mrah
5f8ba85dc5
fix(clustermesh): default clustermesh-apiserver to LoadBalancer (DoD A3) (#1509)
DoD A3 from docs/SOVEREIGN-MULTI-REGION-DOD.md: Cilium ClusterMesh
apiserver Service MUST be LoadBalancer (NEVER NodePort).

Pre-this-change: bootstrap-kit/01-cilium.yaml defaulted
${CLUSTERMESH_SERVICE_TYPE:=NodePort}. Every multi-region Sovereign
landed with clustermesh-apiserver as NodePort, in direct violation of
A3 and breaking AutoEstablishClusterMesh (handler/clustermesh.go,
PR #1508) which hard-fails on Service.type != LoadBalancer.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15):
- 3 cpx52 region cluster (hel1+nbg1+sin) converged HRs Ready=True
- clustermesh-apiserver Service = NodePort on all 3 regions
- cilium-clustermesh peer Secret empty (0 peers) — orchestrator
  never wrote them because of the type-check
- D10 + D12 both failed silently

Fix flips the chart default to LoadBalancer and threads Hetzner CCM
LB annotations (location, type, name) from the bootstrap-kit
substitute env. provisioner now emits CLUSTERMESH_SERVICE_TYPE +
HCLOUD_LB_LOCATION + SOVEREIGN_FQDN_SLUG into the cloud-init
postBuild substitute block alongside the existing CLUSTER_MESH_NAME
+ CLUSTER_MESH_ID.

Operator escape hatch preserved: bare-metal / non-cloud Sovereigns
override CLUSTERMESH_SERVICE_TYPE=NodePort in their per-Sovereign
bootstrap-kit overlay.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:40:04 +04:00
e3mrah
93f699326a
infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net (#1507)
* docs(sovereign): pin multi-region DoD contract — never divert from D1-D14

Founder ruling 2026-05-15: every silent compromise from the multi-region
target-state architecture is a quality violation. This file locks the
convergence contract so future Claude sessions cannot drift.

Architecture invariants A1-A6:
- 3 regions minimum (never drop to 2 to dodge provider capacity)
- Inter-region link = DMZ WireGuard over PUBLIC IPs, ALWAYS
  (no hcloud_network cross-region, no VPC peering, no Huawei VPC)
- Cilium ClusterMesh apiserver = LoadBalancer (NEVER NodePort)
- vCluster topology: primary = MGMT+DMZ, secondary = DMZ+RTZ
- Zero public exposure of K8s control-plane endpoints
- Provider-mix is canonical (assume 1 Hetzner + 1 AWS + 1 Huawei)

DoD gates D1-D14 enforced via Playwright MCP + kubectl + cilium CLI on
every fresh prov. No partial credit, no "deferred", no "matrix-drift".

Mirrored to auto-memory at
~/.claude/projects/-home-openova-repos-openova-private/memory/sovereign_multiregion_dod.md
so it loads at every session start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* infra(hetzner): per-region hcloud_network — DMZ-WG, no shared private net

Implements A1+A2+A6 from docs/SOVEREIGN-MULTI-REGION-DOD.md. Each region
gets its own hcloud_network (10.0.0.0/16 INSIDE each, not shared across).
Inter-region link is exclusively Cilium WireGuard over PUBLIC IPs through
the DMZ — no provider's internal network ever spans regions.

- Replaces hcloud_network.main + hcloud_network_subnet.{main,secondary}
  with hcloud_network.region[*] + hcloud_network_subnet.region[*]
  (for_each over toset(local.all_region_keys); primary key = "primary",
  secondary keys = slice-G1 "{cloudRegion}-{index}" shape).
- Per-region cluster-cidr (10.42+i.0/16) + service-cidr (10.96+i.0/16)
  threaded through cloud-init so ClusterMesh peers don't collide on
  pod/service CIDRs (DoD gate D11).
- Firewall: open UDP 51871 from 0.0.0.0/0 (Cilium WG inter-region
  encryption) — without this the WG mesh between regions cannot form.
- Each CP's local private IP is now uniformly 10.0.1.2 per region
  (every region has its own /24 inside its own /16 — no cross-region
  IP collision class possible by construction).
- Hetzner resource names threaded to cluster-autoscaler now use
  hcloud_network.region["primary"|<k>].name so autoscaler-spawned
  workers land in the same isolated /16 as their region's CP.
- Pre-2026-05-15 state will plan a network-recreate on next apply;
  per DoD cycle protocol this is consciously accepted (no tofu state
  mv runbook, every wipe-and-create is a fresh provision).
- tofu tests cover: per-region network count + uniform 10.0.0.0/16 +
  uniform 10.0.1.0/24 subnet + per-region cluster/service CIDRs +
  Cilium WG firewall rule existence.
- README "Network" section adds the 3-region DMZ-WG ASCII topology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tofu): apply tofu fmt — fixes CI fmt-check on PR #1507

Apply OpenTofu's canonical formatting to main.tf. No semantic
changes; only whitespace alignment under template substitute blocks
where my refactor added 2-char fields (`cluster_cidr` and
`service_cidr`) that perturbed the prior column alignment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude <claude@anthropic.com>
2026-05-15 22:04:32 +04:00
e3mrah
3a19bb161f
fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml (#1503)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 19:48:58 +04:00
e3mrah
1dc21bfd51
fix(cloud-init): accept Hetzner DHCP routes on private NIC (use-routes: true) (#1489)
The netplan stanza for the hot-attached private NIC had
`dhcp4-overrides.use-routes: false`, which discards Hetzner DHCP's
classless static routes. Result: the interface gets `10.0.1.2/32` (host
route only) with NO route for the 10.0.0.0/8 private network. The
kernel routes all return traffic (including SYN-ACK to the Hetzner LB
at 10.0.1.254) via eth0's default route — the public NIC.

Hetzner LB's health check on private network gets the SYN forwarded,
but the SYN-ACK arrives via the wrong NIC; Hetzner drops it as
asymmetric. Target stays `unhealthy` forever on every service port.
Caught live on prov 6dfade27 (omani.works, 2026-05-14): all 3 region
LBs marked unhealthy on 53/80/443 — public surface blackholed despite
3-region × 45/45 HRs Ready + valid PROD cert + envoy listening on
0.0.0.0:30443.

Confirmed via tcpdump on the host:
  enp7s0 In  10.0.1.254.X > 10.0.1.2:30443 [S]   ← SYN arrives on private
  eth0   Out 10.0.1.2:30443 > 10.0.1.254.X [S.] ← SYN-ACK on wrong NIC

Fix: change to `use-routes: true`. Hetzner DHCP-provided routes have
higher metric than eth0's default (metric 100), so the public default
stays intact; we only gain the per-subnet 10.0.0.0/N route needed for
symmetric routing on the private NIC.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:52:01 +04:00
e3mrah
cebc9542d7
fix(cloudinit): escape ${WILDCARD_CERT_ISSUER} reference in comment so templatefile() doesn't try to interpolate it (#1485)
OpenTofu's `templatefile()` parses `${...}` expressions everywhere in the
template body — including comments. A comment on line 1072 of
cloudinit-control-plane.tftpl referenced the Kustomization-time variable
`${WILDCARD_CERT_ISSUER}` as documentation, but tofu reads it as a
template var lookup → fails with `vars map does not contain key
"WILDCARD_CERT_ISSUER"` → `tofu plan` exit 1.

Fix: escape the documentation reference with `$${WILDCARD_CERT_ISSUER}`
so it survives as literal text in the rendered file. The actual variable
binding `WILDCARD_CERT_ISSUER: "${wildcard_cert_issuer}"` two lines below
is unchanged (it correctly maps the lowercase tofu local to the
uppercase Kustomization postBuild key).

Caught live on prov #81 (omani.works), the first provision after #1481
landed the WILDCARD_CERT_ISSUER threading. omantel.biz had been
provisioned BEFORE #1481 merged so it never exercised the new tftpl
path.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:20:51 +04:00
e3mrah
a88e132be9
fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu (#1481)
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:25:45 +04:00
e3mrah
a75463f76a
fix(cloud-init): wait for private NIC before k3s install (prov #71) (#1464)
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)

Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-init): wait for private NIC before k3s install (prov #71)

Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.

Effect on secondary CPs: k3s server starts with
  --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
  "listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.

Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.

Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:39:25 +04:00
e3mrah
32e0b408bf
fix(k3s): add public IP --tls-san + openova.io/region node label (#1459)
Two related fixes for multi-region + qa-fixtures DoD on prov #64:

1. **k3s TLS cert needs the public IPv4 in SAN.**
   Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP
   (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s
   auto-generates the server cert with SANs from --tls-san flags. We
   only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2
   + cluster-ip + 127.0.0.1 only. Bridge connection from contabo
   rejected with:
     "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1,
      ::1, not 204.168.212.113"
   → silent watcher failure → 0 secondary HRs observed → canvas missing
   region sub-groups.
   Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before
   k3s install, add it as --tls-san=$CP_PUBLIC_IPV4.

2. **openova.io/region=hz-fsn-rtz-prod node label.**
   qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs,
   qa-wp Application) carry hard nodeAffinity for
   `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion
   default in products/catalyst/chart/templates/qa-fixtures/*.yaml).
   Without the label every fixture pod FailedScheduling → bp-catalyst-
   platform post-install hook waits forever → bootstrap-kit chain hangs
   at 44/45 with bp-catalyst-platform Running.
   Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP
   (qa-fixtures pin to primary by design).

Both shipped in same commit since both are inside the same k3s server
install line.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 19:38:25 +04:00
e3mrah
44913d8a6a
fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458)
prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook
timed out because the catalyst-api Helm-released pod stayed Pending
with "Too many pods. 0/1 nodes are available".

k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed
deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns
Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/
flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on
prov #63 the CP carried everything alone and dropped scheduling at 110.

Bump to 220 on both CP and worker so the saturation point doesn't gate
the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU
+ 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit
weight.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 18:37:42 +04:00
e3mrah
5f4f9f2cb5
fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457)
prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop
with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects
its node IP from the primary interface, which on Hetzner cpx52 binds
to the public IPv4 (49.x.x.x) instead of the private network IP
(10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there;
nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the
private IP from cilium-config k8sServiceHost — times out, CrashLoop.

Worked by luck on cpx42 (earlier kernel + Hetzner network attach
timing). cpx52 reproduces 100%.

Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip}
in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP
AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443
(cilium-config substitute) find the API server every time.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:30 +04:00
e3mrah
68372d700b
fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448)
* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:11:23 +04:00
e3mrah
be47815ddf
fix(infra): pass cp_private_ip to primary CP templatefile too (#1447)
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:01:43 +04:00
e3mrah
cdcc50a213
fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446)
Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3
"no stretched fault domain". Cilium on each region MUST talk to its
OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites
hardcoded the primary's IP:

1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665):
   `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region
   by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2).

2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}`
   so each region's k3s API cert validates against the LOCAL CP's IP.

3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml):
   add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR
   values so Flux postBuild.substitute can override per region. The
   cloud-init Kustomization renders the substitute var to `${cp_private_ip}`.
   Single-region (primary-only) provisions fall back to the
   default `10.0.1.2` and stay byte-identical to today.

Live evidence of the bug — prov #52 (3-region) on 2026-05-12:

  cilium-operator on nbg1 secondary:
  "Establishing connection to apiserver" host="https://10.0.1.2:6443"
  "failed to start: ... tls: failed to verify certificate:
   x509: certificate signed by unknown authority"

Each region's k3s has its OWN self-signed CA (cluster-init per CP). The
primary's API cert isn't signed by the secondary's CA → cilium crash-
loops → no CNI → flux controllers Pending → no HRs → canvas shows only
primary's HRs. This fix points each region's cilium at the LOCAL CP,
whose API server presents the matching CA from this cluster.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 19:56:18 +04:00
e3mrah
19a847e514
fix(infra): restore \n escape in secondary CP templatefile regex (#1445)
The conflict-resolution Python script in PR #1444 wrote a literal
newline where the regex string needed the two-char "\n" escape. tofu
init rejected with "Invalid multi-line string / Unterminated template
string" on main.tf:925.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:27:10 +04:00
e3mrah
4923938c2b
feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444)
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.

End-to-end change across infra + handler:

1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
   appends `?region=<kubeconfig_postback_region>` when the var is set.
   main.tf templatefile call passes empty for primary CP, `each.key`
   (e.g. "nbg1-1", "hel1-2") for each secondary region.

2) PutKubeconfig handler: reads ?region= query param. Empty → primary
   path (unchanged: stores at <dir>/<id>.yaml, sets
   Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
   → secondary path: stores at <dir>/<id>-<region>.yaml, populates
   Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
   per-region (the same bearer secures every CP's PUT — secondaries
   reuse it for their own slot). NO Phase-1 watch re-launch from a
   secondary PUT.

3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
   primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
   spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
   the Watcher on Deployment.secondaryWatchers[region]. Per-region
   watchers emit ordinary helmwatch events with region-prefixed
   Component names so the wizard's per-component view doesn't collide
   primary vs secondary bp-cilium events. They do NOT contribute to
   markPhase1Done — outcome remains the primary's classification.

4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
   bubbles + install-* nodes from each secondary watcher's
   SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
   FlowNode.region set so the canvas can colour-group. Intra-region
   finish-to-start deps emitted from cs.DependsOn — same-region only,
   never cross-region (per NAMING-CONVENTION §1.3 independent fault
   domains, no stretched cluster).

5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
   kubeconfig file on Sovereign wipe.

Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.

Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:12:38 +04:00
e3mrah
c5d891ad0b
fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443)
The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name /
hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster
autoscaler could attach scale-up VMs to the private network. The
primary CP's templatefile call at main.tf:483-485 was updated, but the
matching call for secondary regions at main.tf:899 was missed.

Result: any provision with regions[] of length > 1 fails at tofu plan
with "vars map does not contain key hcloud_network_name" referenced in
cloudinit-control-plane.tftpl:478.

Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash)
at T+0:47. Forward the same three resource refs to every secondary
region's templatefile call.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:23:53 +04:00
e3mrah
b743b646ac
fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427)
Root cause (autoscaler pod log, prov #43 chroot):
  W orchestrator.go:626 Node group workers is not ready for scaleup -
  backoff with status: Scale-up timed out for node group workers after
  15m2.273255226s

Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY:
  workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[]
  workers-a6410e81b24cced  public_net.ipv4=178.105.73.210  private_net=[]

The worker cloud-init (identical to Phase-0 user_data) issues
  curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.2:6443 ... sh -
against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment
that URL is unreachable → k3s agent install silent-fails → node never
registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst-
platform Pending Pods never schedulable → chroot canvas tests blocked.

Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on
the cluster-autoscaler deployment so the Hetzner provider attaches every
scale-up VM to the SAME private network + firewall + ssh-key the Phase-0
Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net /
-fw / catalyst-<sov-fqdn-with-dashes>). Names flow:

  Tofu (hcloud_network.main.name + hcloud_firewall.main.name +
        hcloud_ssh_key.main.name)
   → cloudinit-control-plane.tftpl (3 new template vars)
   → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys)
   → flux-system/cloud-credentials Secret
   → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries
     with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*)
   → upstream chart's deployment env

Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent
regression of the three env-var slots in chart values.yaml.

Reaffirms canonical seam: values flow through Tofu → cloud-init →
flux-system Secret → Flux valuesFrom → chart values → upstream env.
Never via kubectl patch, never via bespoke Go API calls.

Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 06:11:30 +04:00
e3mrah
22855e62d8
feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396)
Final integration piece for OpenovaFlow infrastructure path —
catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID
+ SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits
distinct region tags on every FlowNode and the snapshot returns 2× per
HR on a multi-region Sovereign.

Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go
server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst-
ui temporary revert until npm workspaces land), PR #1395 (chart no-op).

## Scope vs original Agent #3 brief

The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire +
runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred:
PR #1394 reverted Agent #1's UI wiring because the Docker UI build has
no node_modules for the cross-workspace canvas source. Founder note on
#1394: "Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root."

This PR ships the infrastructure half (proxy + cloud-init + runbook).
The canvas-side rewire is a separate follow-up PR that needs npm
workspaces, not surgical edits to FlowPage.

## What ships

### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events}

products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go:
- GET /snapshot — JSON pass-through, headers + status forwarded
- GET /stream — unbuffered SSE pass-through using http.Flusher (NOT
  httputil.ReverseProxy; that buffers and breaks text/event-stream)
- POST /events — body forwarded byte-for-byte
- Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign
  in-cluster Service DNS)

Routes registered in cmd/api/main.go inside the auth-gated chi.Group.

11 table-driven tests cover snapshot/events/stream pass-through, upstream
404/400/unreachable propagation, empty-deploymentId guard, SSE frames
arrive AS EMITTED, and env-default fallback.

### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY

- infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild.
  substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP
- infra/hetzner/main.tf — primary CP renders var.region as region key;
  secondary CP renders each.key (e.g. "hel1-1") from for_each over
  local.secondary_regions
- infra/hetzner/variables.tf — new sovereign_deployment_id var (string,
  default "" for tofu mocks)
- provisioner.go writeTfvars — writes vars["sovereign_deployment_id"]
  = req.DeploymentID
- bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal
  "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY}
  envsubst keys

### 3. Deployment record flag

handler/deployments.go State() — emits `openovaFlowEnabled: true` on
every deployment. The catalyst-ui rewire (follow-up PR) will read this
to enable the openova-flow-server adapter; legacy provisions without
the flag will keep the bridge once the rewire lands.

### 4. Verification runbook

docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body
(multi-region cpx42 fsn1+hel1, qaTestEnabled=true,
sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual
canvas checks (gated on the follow-up UI rewire), and a failure-class
triage table.

## Canonical-seam citations

1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/
   deployments.go:1244-1287 (StreamLogs): identical Content-Type +
   Cache-Control + X-Accel-Buffering header set; identical
   http.Flusher.Flush() after each write; identical r.Context().Done()
   cancel path.

2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893
   (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var}
   form, dual emission at primary + secondary CP for_each in main.tf.

## Verification

```
$ go build ./...
(clean)

$ go vet ./...
(clean)

$ go test ./internal/handler/ -run TestFlowProxy -count=1 -race
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler   1.410s

$ go test ./internal/provisioner/... -count=1
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner  0.025s
```

3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields,
TestHandleWhoami_PinSessionRBACClaims,
TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on
main HEAD without this PR — unrelated baseline state.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:01:09 +04:00
e3mrah
4e6bec7022
fix(infra): body-supplied SKUs win over QA defaults (Fix #183) (#1386)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

* fix(infra): body-supplied SKUs win over QA defaults (Fix #183)

Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size,
var.control_plane_size)` when qa_fixtures_enabled='true'. Because
qa_control_plane_size has a non-empty default (cpx32), coalesce always
returned the QA default and silently overrode whatever the body supplied
in `controlPlaneSize`.

Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"`
explicitly (cheapest viable for the founder's collapsed-CP+worker
single-node-per-region topology with workerCount=0). The QA-default
override downgraded that to cpx32 at plan time — the explicit choice
never made it onto the hardware.

Fix #183 — invert the coalesce so body wins:

  effective_cp_size = local.qa_mode
    ? coalesce(var.control_plane_size, var.qa_control_plane_size)
    : var.control_plane_size

`provisioner.go` writeTfvars already emits control_plane_size / worker_size
only when the body's field is non-empty (so `var.control_plane_size`
inherits variables.tf's cost-optimised default when the body left it
blank). That means `coalesce(var.control_plane_size, var.qa_*)` always
has a non-empty first arg in normal flow; the QA-default fallback only
fires on a zero-override QA call that intentionally leaves the SKU empty.

No change to customer-Sovereign behaviour (qa_fixtures_enabled='false'
branch already used `var.control_plane_size` verbatim).

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:04:41 +04:00
e3mrah
515c3cf38d
fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) (#1385)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:00:50 +04:00