Commit Graph

13 Commits

Author SHA1 Message Date
e3mrah
31784d7ed5
fix(bp-external-dns): apiserver Endpoints sync timeout — Cilium kube-apiserver entity required (closes #770) (#771)
* fix(bp-external-dns): grant apiserver egress via CiliumNetworkPolicy (closes #770)

Root cause: ExternalDNS crashloops on every fresh Sovereign provision
with `failed to sync *v1.Endpoints: context deadline exceeded`. The
companion vanilla NetworkPolicy egress rule
`to: ipBlock: 0.0.0.0/0 ports: 443,6443` does NOT match traffic to the
kube-apiserver under Cilium with the default `policy-cidr-match-mode: ""`.
Cilium models the apiserver as a reserved identity, not a CIDR range,
so the ipBlock rule is bypassed and the apiserver call is dropped at
the egress hook of the external-dns endpoint.

Fix: render a companion CiliumNetworkPolicy with
`toEntities: [kube-apiserver]` scoped to the external-dns Pod selector.
This is the canonical Cilium pattern for controllers that watch the
apiserver. The existing vanilla NetworkPolicy is preserved verbatim so
the Blueprint remains CNI-agnostic per BLUEPRINT-AUTHORING.md.

Live proof on otech93 (2026-05-04): manually applied the rendered CNP
to the running cluster, external-dns transitioned from CrashLoopBackOff
(8 restarts in 20m) to 1/1 Running within 30s, informer cache sync
completed cleanly.

Bumps bp-external-dns 1.1.6 → 1.1.7.

Why not `policy-cidr-match-mode: nodes` cluster-wide on bp-cilium? It
silently relaxes EVERY other NetworkPolicy that uses 0.0.0.0/0 in the
cluster — too broad. Per INVIOLABLE-PRINCIPLES the fix MUST be scoped
to the workload that needs it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(_template): bump bp-external-dns 1.1.6 → 1.1.7 to pick up CNP fix

Pairs with the chart bump in the same PR. Every fresh otech provision
hydrates clusters/_template/, so this pin is what determines the
version installed. Without bumping here, otech94+ would still use
1.1.6 and continue to crashloop with the apiserver-egress symptom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:27:17 +04:00
e3mrah
c5ffaa2fd7
fix(bp-external-dns): livenessProbe.initialDelaySeconds=180 for cold-cluster cache-sync (closes #700) (#707)
PR #679 added --request-timeout=120s but external-dns has TWO timeouts:
RequestTimeout (per-API-call, controlled by --request-timeout) and
WaitForCacheSync (initial informer sync, hardcoded 60s in upstream binary,
NOT exposed as a flag). On a fresh Sovereign with k3s apiserver
CPU-saturated, the cache sync misses 60s -> fatal: failed to sync
*v1.Node: context deadline exceeded -> CrashLoopBackOff 5-10 times.
Caught live on otech49+ (2026-05-03), 5 restarts before stable.

Bump livenessProbe.initialDelaySeconds from upstream 10s default to 180s
so kubelet does NOT restart the Pod while the initial cache sync runs
against a CPU-saturated freshly-provisioned k3s apiserver. The Sovereign
apiserver reaches steady-state within ~2 min so 3 min comfortably covers
cold starts. Also bumps periodSeconds=30 + failureThreshold=3 so a
genuinely-hung pod is still killed within ~90s once steady-state.
readinessProbe gets a corresponding initialDelaySeconds=30 so endpoint
flapping during sync doesn't churn services.

Helm overrides REPLACE whole maps (not merge), so the override preserves
the upstream httpGet.path: /healthz + port: http shape verbatim.

Bumps:
- platform/external-dns/chart/Chart.yaml: 1.1.5 -> 1.1.6
- clusters/_template/bootstrap-kit/12-external-dns.yaml: HelmRelease pin 1.1.5 -> 1.1.6

Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 23:39:36 +04:00
e3mrah
a50ef0ece0
fix(bp-external-dns): --request-timeout=120s for cold-cluster initial sync (1.1.5) (#679)
Caught live on otech43–46: external-dns crashloops 10+ times on fresh
Sovereign before initial *v1.Pod sync completes. Default 30s timeout
insufficient when k3s apiserver is CPU-saturated.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:50:37 +04:00
e3mrah
4b2ae76cfd
fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) (#589)
* fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587)

The native pdns provider in external-dns v0.15.1 does not accept
--pdns-api-version; the binary fatals at startup with:
  'unknown long flag --pdns-api-version'
causing CrashLoopBackOff (53+ restarts on otech22).

The provider auto-negotiates the PowerDNS API version — the flag is
superfluous and broken. Remove it from extraArgs.

Bump bp-external-dns 1.1.3 → 1.1.4. Bootstrap-kit slots updated for
_template, otech.omani.works, omantel.omani.works.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

* fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml

The dependency-graph-audit check was failing because:
1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but
   bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml
2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector]
   in the HelmRelease but expected-bootstrap-deps.yaml only declared
   [bp-cert-manager, bp-powerdns]

Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update
bp-external-dns depends_on to include bp-reflector in the expected DAG.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 15:20:00 +04:00
e3mrah
06844d3a70
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but
bp-external-dns still had `powerdnsNamespace: openova-system` in its
NetworkPolicy egress rule and `--pdns-server=...openova-system...` in
extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation.

Fix:
- externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns
- extraArgs --pdns-server: ...openova-system... → ...powerdns...

Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:58:24 +04:00
e3mrah
bcd2e7980a
fix: hide CRD-emitting resources behind Capabilities gates (closes #190) (#200)
* fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates
in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh
Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have
not yet reconciled — no longer fails with `no matches for kind X in
version Y`. The values toggles (`externalDns.serviceMonitor.enabled`,
`externalDns.externalSecret.enabled`) remain — Capabilities is defense
in depth so an operator flipping the toggle on a Sovereign that hasn't
reached Phase 2 doesn't break the bp-external-dns reconcile.

Verified locally: `helm template` with toggles off renders 0 of these
resources; with toggles ON and `--api-versions monitoring.coreos.com/v1
--api-versions external-secrets.io/v1beta1` both render exactly once.

Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix
wave from issue #190.

* fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Three Catalyst overlay templates emit resources whose CRDs ship in OTHER
charts and were unconditionally rendered, causing a cold install of
bp-powerdns to fail with `no matches for kind X` on a Sovereign that
hasn't yet reconciled the upstream chart:

  - cnpg-cluster.yaml          → postgresql.cnpg.io/v1 Cluster
                                 (CRD ships in bp-cnpg)
  - api-ingress.yaml           → traefik.io/v1alpha1 Middleware
                                 (CRD ships with the Traefik controller;
                                  k3s ships it by default but a Sovereign
                                  overlay MAY disable Traefik in favour
                                  of cilium-only ingress)
  - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP
                                 (CRD ships when the Catalyst Crossplane
                                  composition family lands — see GAP
                                  DISCLOSURE in that template)

Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`.
The Traefik router-middleware annotation on the Ingress is similarly
gated so the auth posture cleanly moves to the Sovereign's chosen
ingress controller when Traefik is absent.

Verified locally: `helm template` with default values renders 0 of
these resources; with `--api-versions postgresql.cnpg.io/v1
--api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1`
plus `--set crossplane.floatingIP.enabled=true`, all three render
exactly once. Existing tests/observability-toggle.sh still passes.

Bump version 1.1.1 → 1.1.2.

* fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 20:10:14 +02:00
hatiyildiz
4265884d58 feat(bp-external-dns): umbrella chart + add to bootstrap-kit Kustomization
Convert platform/external-dns/chart/ from a metadata-only wrapper to a
proper Helm umbrella that pulls kubernetes-sigs/external-dns 1.15.2
(appVersion 0.15.1, k8s 1.31-validated) as a Helm subchart, mirroring
the bp-cilium / bp-cert-manager / bp-powerdns shape. Native PowerDNS
provider speaks the bp-powerdns REST API directly via the
EXTERNAL_DNS_PDNS_API_KEY env var sourced from the
powerdns-api-credentials Secret bp-powerdns renders.

Catalyst overlay templates added (default-off where applicable per the
observability-toggle rule for the bp-* family):
  - templates/networkpolicy.yaml      (default ON; egress to powerdns +
                                       cluster DNS + apiserver only)
  - templates/servicemonitor.yaml     (default OFF)
  - templates/externalsecret.yaml     (default OFF; Phase-2 OpenBao path)
  - templates/_helpers.tpl

Bootstrap-kit Kustomization gets a new 12-external-dns.yaml HelmRelease
referencing bp-external-dns:1.1.0 with dependsOn bp-cert-manager +
bp-powerdns, and the legacy 11-bp-catalyst-platform.yaml is renumbered
13- so the install ordering reads in canonical Phase-0 sequence. Mirrored
to clusters/omantel.omani.works/bootstrap-kit/ with the SOVEREIGN_FQDN
substitution applied.

bp-catalyst-platform Chart.yaml drops bp-external-dns from its
dependency block — install ordering for ExternalDNS is now owned by Flux
dependsOn at the Kustomization layer rather than this umbrella's Helm
dependency graph. Bumped 1.1.0 → 1.1.1 to reflect the dep removal, and
the bootstrap-kit HelmRelease references in both clusters bumped in
lockstep.

Wrapper chart version bumped 1.0.0 → 1.1.1 (umbrella shape).

Local gates pass:
  - helm dependency build (pulls external-dns-1.15.2.tgz)
  - helm lint (0 failures)
  - helm template smoke render (245 lines, 6 kinds rendered)
  - helm package + tar-tzf verifies external-dns subchart inside the
    packaged tgz (subchart-guard simulation passes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:29:27 +02:00
hatiyildiz
02b5b6c4c8 fix(bootstrap-kit): override cilium + cert-manager values to disable observability toggles
Live verified on omantel: bp-cilium and bp-cert-manager v1.1.0 fail Helm
install with 'no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1'. Manual kubectl-patch of the live HelmRelease
worked but Flux's 15-min reconcile rolls back the patch because the
HelmRelease CR is owned by the kustomize-controller from git.

Override the values inline in the HelmRelease manifests so the patch is
durable across Flux reconciles. Same pattern as the in-flight observability-
toggle agent will apply to all 12 charts in the next chart bump (v1.1.1).
This is the manifest-level workaround that unblocks the running omantel
cluster TODAY without waiting for v1.1.1 publish.

Mirrors the patches into both clusters/_template/bootstrap-kit/ AND
clusters/omantel.omani.works/bootstrap-kit/ so future Sovereigns inherit.
2026-04-29 19:17:08 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
c07e0ad1ee feat(external-dns): #109 — author bp-external-dns leaf chart for OCI publish
The bp-catalyst-platform umbrella (issue #104) declares a dependency on
bp-external-dns:1.0.0 — but the chart didn't exist; only README + Dynadot
multi-domain policy lived under platform/external-dns/. Without this leaf
the umbrella's `helm dependency build` fails (verified in run 25068433765).

This commit authors the minimal target-state leaf:
- Chart.yaml: name=bp-external-dns, version=1.0.0
- values.yaml: catalystBlueprint.upstream metadata (external-dns 1.15.0
  from kubernetes-sigs/external-dns Helm repo) + Catalyst-curated values
  overlay (sources, txtOwnerId, ServiceMonitor, RBAC, resources)

Per BLUEPRINT-AUTHORING.md §3, leaf charts are pure values-overlay wrappers:
no templates dir, just Chart.yaml + values.yaml with the catalystBlueprint
metadata block read by the bootstrap-kit installer at helm-install time.

Per-Sovereign provider/zone/credential overrides are overlaid by the
Crossplane Composition that materializes the HelmRelease — keeping this
chart provider-agnostic (no hardcoded Cloudflare/Dynadot/Hetzner choice
per INVIOLABLE-PRINCIPLES.md §4).

After this lands, blueprint-release.yaml will publish
ghcr.io/openova-io/bp-external-dns:1.0.0 and the next umbrella push will
resolve all 11 leaf deps successfully.
2026-04-28 19:42:23 +02:00
hatiyildiz
f0fe3006ba feat(external-dns): #109 — Catalyst-curated dynadot-multi-domain policy
Adds platform/external-dns/policies/dynadot-multi-domain.yaml — the
canonical external-dns + dynadot webhook deployment that ships in every
Sovereign on an OpenOva pool domain.

Why a webhook: external-dns has no upstream Dynadot provider; the
canonical pattern is the webhook RPC contract, with a sidecar that
implements the provider in our preferred language. We reuse the same
internal/dynadot/ package the catalyst-api uses, so the never-wipe rule,
record encoding, and managed-domain allowlist are identical on both
write paths (per docs/INVIOLABLE-PRINCIPLES.md #2 — no duplicate
implementations of the same concern).

Multi-domain:
- One --domain-filter per zone in the external-dns args; adding a third
  pool domain (e.g. acme.io) is a one-line edit here PLUS a one-key edit
  on dynadot-api-credentials' `domains` field. No webhook rebuild.
- Webhook reads DYNADOT_MANAGED_DOMAINS from the same secret with
  optional=true, preserving backward compatibility with the legacy
  single-`domain` secret shape (pre-#108).

TXT registry:
- --txt-owner-id=$(SOVEREIGN_FQDN), --txt-prefix=_externaldns.<sub>.
- Cluster overlays substitute SOVEREIGN_FQDN via the bp-catalyst-platform
  umbrella so two clusters sharing a parent zone (alpha.omani.works,
  beta.omani.works) cannot collide.

Closes #109.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:45:53 +02:00
hatiyildiz
5834daec14 docs(pass-10): banners on 7 more components + opentofu active-active drift fix
7 more component READMEs got role-in-Catalyst banners:

- vpa, keda, reloader → per-host-cluster scaling/ops layer (§3.4).
  Reloader specifically calls out its role in Catalyst's secret-
  rotation flow (rolling deploy on K8s Secret hash change).
- external-dns → per-host-cluster DNS-sync (§3.1); pairs with k8gb
  for the GSLB zone separation.
- coraza → DMZ-block WAF on every host cluster (§3.1).
- crossplane → per-Sovereign on the management cluster (§3.2);
  banner explicitly emphasizes the agreed "never a user-facing
  surface" rule (Users don't write Compositions in Application
  configs; Blueprint authors and advanced contributors do). Cross-
  references the no-fourth-surface clause in ARCHITECTURE §4/§7
  and the Crossplane Composition section in BLUEPRINT-AUTHORING §8.
- opentofu → repositioned as Phase-0-only, runs on `catalyst-
  provisioner` only, NOT installed on host clusters at runtime.

opentofu drift fixes (uncovered by line-by-line read):
- Section 5 line 182: "Bootstrap Wizard prompts for cloud credentials"
  → "Catalyst Bootstrap (Phase 0) prompts for cloud credentials"
  (banned term).
- Same section line 186: "ESO PushSecrets sync to both regional
  OpenBao instances" — the active-active drift Pass 7 corrected
  elsewhere, still here. Replaced with "writes go to the primary
  OpenBao region only; replicas pick up via async perf replication".

VALIDATION-LOG: Pass 10 entry added.

Refs #37
2026-04-27 21:43:45 +02:00
talent-mesh
c9d04a53b4 refactor: flatten platform/ structure (41 components)
Remove hierarchical grouping (networking/, security/, etc.) and use flat
structure for all 41 platform components.

Changes:
- All components now directly under platform/ (no subfolders)
- AI Hub components moved from meta-platforms/ai-hub/components/ to platform/
- Open Banking components (lago, openmeter) moved to platform/
- meta-platforms/ now only contains README files that reference platform/
- Open Banking custom services remain in meta-platforms/open-banking/services/

Structure:
- platform/ (41 components, flat)
- meta-platforms/ai-hub/ (README only, references platform/)
- meta-platforms/open-banking/ (README + 6 custom services)

All documentation links updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:19:48 +00:00