openova/infra
e3mrah 2050e72c69
fix(infra): refactor L3 ExternalIP reconciler to write_files + bump CP guardrail to 32256 (Closes #1981, Refs #1979 #1941) (#1985)
PR #1979 (TBD-A50 layer 3, merged 18:00Z 2026-05-19) added the
idempotent ExternalIP reconciler as inline runcmd heredocs and bumped
the rendered cloud-init guardrail from 30720 to 31744. The ~3 KiB of
inline bash + systemd unit heredocs overshot the new headroom: t36
fresh-prov tofu plan FAILED with rendered control-plane cloud-init
at ~32498 B vs the 31744 B guardrail (754 B over). Issue #1981.

This PR repackages PR #1979 using the PR #1978 pattern that fixed the
analogous #1977 / TBD-A52 incident:

- Adds an `l3` subcommand to /usr/local/bin/openova-externalip-bootstrap.sh
  (the same write_files script that hosts `l1` + `l2`). Same reconciler
  logic — read /etc/openova/cp-public-ipv4, compare to Node ExternalIP,
  restart k3s on mismatch, log to /var/log/openova-externalip.log.
- Adds two new write_files entries for the systemd .service + .timer
  unit files (replaces the 3× cat-heredoc runcmd block).
- The runcmd L3 step collapses from 77 lines of inline heredocs to
  a single token: `systemctl daemon-reload && systemctl enable --now
  openova-extip-reconcile.timer`.
- Bumps the CP cloud-init guardrail from 31744 to 32256 (Hetzner hard
  cap 32768 minus 512 B safety buffer), applied to both primary +
  secondary CP preconditions in main.tf. The +512 B headroom buys
  room for the next legitimate addition without re-tripping the gate.

## Behavior

Behavior identical to PR #1979 — same reconciler script, same exit
codes (0=ok, 2=no-file, 3=apiserver-unreachable, 4=unrecovered), same
systemd .service `SuccessExitStatus=0 2 3 4`, same .timer `OnBootSec=2min
/ OnUnitActiveSec=5min`. Diagnostic strings trimmed (~150 B saved) but
key tokens preserved (`OK`, `MISMATCH`, `RECOVERED`, `FATAL nofile`,
`FATAL apiserver`, `FATAL unrec`, `#1941` reference).

## Validation (Principle #15)

- `tofu validate infra/hetzner/` → Success
- Templatefile() measurement harness (`/tmp/measure-cloudinit/`,
  same fixture PR #1978 used):
    - pre-fix rendered: 31865 B (over fixture 30720 by 1145 B)
    - post-fix rendered: 31130 B (under new 32256 guardrail with
      1126 B headroom)
    - savings: ~735 B vs PR #1979 baseline
- Production headroom (after +633 B fixture↔prod variance offset):
  estimated 31763 B in prod, 493 B headroom under new 32256 guardrail.
- `shellcheck` on rendered bootstrap script: clean (only one pre-
  existing SC2034 for loop counter `i`, present before this PR).
- Mock test 3-case battery (matching/missing-file/mismatch-recovers):
  rc=0/2/0 with expected log tokens.

## Hard rules

- `Closes #1981` because acceptance is code-level (size proof + tofu
  validate). The functional Refs #1941 closure still depends on fresh-
  prov walk demonstrating timer fires + log accumulates.
- READ-ONLY on cluster. No Secrets touched. No emrah.baysal email
  / Stalwart admin API touched.

Refs #1941, #1979, #1978, #1977, #1958, #966.

Co-authored-by: hatiyildiz <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:58:57 +04:00
..
cloudflare-worker-leases feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159) 2026-05-09 08:01:44 +04:00
hetzner fix(infra): refactor L3 ExternalIP reconciler to write_files + bump CP guardrail to 32256 (Closes #1981, Refs #1979 #1941) (#1985) 2026-05-19 22:58:57 +04:00