History

e3mrah 71e8101363 fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989 ) (#1991 ) * fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989) t37 canonical walk on nbg1-2 / hel1-1 secondary CPs surfaced a second stuck-HR failure mode: helm-controller completes the install — the HR's own `.status.history[0].status` flips to "deployed" — but apiserver flap on the slow secondary CP loses the write that flips `.status.conditions[type=Ready]` from Unknown to True. The existing suspend-toggle recovery (issue #925) does NOT fix this because helm- controller's "release in storage" short-circuit returns yes on every subsequent reconcile, so it never re-evaluates Ready. This PR extends the stuckHelmReleaseRecovery CronJob with a second detection branch: for hr where .status.conditions[type=Ready].status == "Unknown" AND age(Unknown) > stuckThreshold (default 5m) AND .status.history[0].status == "deployed" AND metadata.annotations["stuck-hr-recovery.openova.io/auto-corrected-at"] == "" → kubectl annotate hr stuck-hr-recovery.openova.io/auto-corrected-at=<RFC3339> → kubectl patch hr --subresource=status --type=merge status.conditions=[{type:Ready, status:True, reason:ReconciliationSucceeded, message:"auto-corrected from deployed-but- unknown-Ready by stuck-hr-recovery (TBD-A66)", lastTransitionTime:<RFC3339>}] Safety / idempotency: - Annotation acts as both audit trail AND idempotency guard. Re-runs on an already-corrected HR skip immediately. - If the status patch fails, the annotation is rolled back so the next CronJob run re-attempts. - Guardrail unchanged: >10 acted-on HRs in a single run → exit 1 + operator alert. - The 10-HR guardrail spans BOTH branches combined. RBAC additions: - helmreleases/status with verbs [patch, update] — status subresource is a separate RBAC target in Kubernetes. Without this rule `kubectl patch --subresource=status` returns 403. Validation: - tests/leader-election-and-recovery.sh: 6 → 7 cases (existing 6 issue #925 cases still PASS; new Case 7 covers TBD-A66 — script contains history[0].status check, status-subresource patch verb, audit annotation key, helmreleases/status ClusterRole verb, and operator-greppable "auto-corrected from deployed-but-unknown-Ready" audit string). - Mock JSONPath replay against 4 synthetic HRs: branch B routes deployed-but-unknown to status patch, branch A still handles pending-install via the secret check, idempotency annotation correctly skips re-run, healthy Ready=True HR is no-op. Chart bump: - platform/flux/chart/Chart.yaml: 1.2.2 → 1.2.3 - clusters/_template/bootstrap-kit/03-flux.yaml: bp-flux HR pin 1.2.2 → 1.2.3 (the existing pin for omantel/otech live clusters sits at 1.1.3 — unchanged, those clusters are pre-#925 baseline). Closure note: - Refs #1989 (not Closes — closure happens when the t37 canonical walk reaches handover successfully on a fresh prov). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-flux): bump blueprint.yaml spec.version 1.2.2 → 1.2.3 (lockstep with Chart.yaml) Companion to TBD-A66 / #1989 bump. CI gate `TestBootstrapKit_BlueprintVersionLockstepSweep` (TBD-A20, #1856) asserts blueprint.yaml spec.version == chart/Chart.yaml version per platform/*. Missed this in the parent commit because the older bp-flux bumps (1.2.1 → 1.2.2 etc.) did not require this companion bump back when the lockstep gate didn't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: claude-bot <claude-bot@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-20 00:38:25 +04:00
..
chart	fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989 ) (#1991 )	2026-05-20 00:38:25 +04:00
blueprint.yaml	fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989 ) (#1991 )	2026-05-20 00:38:25 +04:00
README.md	refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 )	2026-04-29 08:51:09 +02:00

README.md

Flux

GitOps delivery engine. Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3.2). Catalyst runs one Flux instance per vcluster (lightweight: source + kustomize + helm controllers) plus a host-level Flux on each host cluster for the Catalyst control plane itself. Each per-vcluster Flux pulls from the single per-Sovereign Gitea instance (see docs/ARCHITECTURE.md §4).

Status: Accepted | Updated: 2026-04-27

Overview

Flux provides GitOps-based continuous delivery with Gitea as the internal Git provider and External Secrets Operator for secrets management.

Architecture

flowchart TB
    subgraph Gitea["Gitea (Internal Git)"]
        Components[Component Repos]
        Organization[Organization Repos]
    end

    subgraph K8s["Kubernetes Cluster"]
        subgraph Flux["Flux Controllers"]
            Source[source-controller]
            Kustomize[kustomize-controller]
            Helm[helm-controller]
            Notify[notification-controller]
        end

        subgraph ESO["Secrets"]
            ESOp[External Secrets Operator]
            PS[PushSecrets]
        end

        Resources[K8s Resources]
    end

    subgraph External["External"]
        OpenBao[OpenBao]
    end

    Components --> Source
    Organization --> Source
    Source --> Kustomize
    Source --> Helm
    Kustomize --> Resources
    Helm --> Resources
    Notify --> Gitea
    ESOp --> OpenBao
    ESOp --> Resources

Why Flux

Factor	Flux	ArgoCD
Memory overhead	~200MB	~500-800MB
Architecture	Kubernetes-native CRDs	Separate UI/API
Secrets	Via ESO (PushSecrets)	Via ESO (PushSecrets)
CLI workflow	Excellent	UI-focused

Key Decision Factors:

Lower resource overhead
CLI-focused fits single-developer workflow
Kubernetes-native CRDs
Works well with External Secrets Operator
Integrates seamlessly with Gitea

Components

Controller	Memory	Purpose
source-controller	64MB	Git/Helm repo sync
kustomize-controller	64MB	Kustomization apply
helm-controller	64MB	HelmRelease management
notification-controller	32MB	Alerts

Repository Structure

flux/
├── clusters/
│   └── <region>/
│       ├── flux-system/       # Flux controllers
│       ├── network/           # cilium, stunner, powerdns
│       ├── security/          # kyverno, external-secrets, cert-manager
│       ├── database/          # cnpg, ferretdb, valkey
│       ├── middleware/        # strimzi
│       ├── storage/           # seaweedfs, velero
│       ├── observability/     # grafana (LGTM stack)
│       ├── autoscaling/       # keda
│       ├── workplace/         # stalwart
│       └── organizations/     # per-Organization Application deployments

Category	Components	Purpose
network	cilium, stunner, powerdns	CNI + Service Mesh, TURN, authoritative DNS + lua-records GSLB
security	kyverno, external-secrets, cert-manager	Policy, secrets, TLS
database	cnpg, ferretdb, valkey	Database operators
middleware	strimzi	Apache Kafka streaming
storage	seaweedfs, velero	Object storage, backup
observability	grafana	LGTM stack
autoscaling	keda	Event-driven scaling
workplace	stalwart	Email server
organizations	`<org>`	Per-Organization Application deployments

Configuration

GitRepository for Gitea

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: <component>
  namespace: flux-system
spec:
  interval: 5m
  url: https://gitea.<location-code>.<sovereign-domain>/<org>/<component>.git
  ref:
    branch: main
  secretRef:
    name: gitea-token

Kustomization

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: <component>
  namespace: flux-system
spec:
  interval: 10m
  targetNamespace: <namespace>
  sourceRef:
    kind: GitRepository
    name: <component>
  path: ./deploy/prod
  prune: true
  wait: true
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: <component>
      namespace: <namespace>

Multi-Region GitOps

Catalyst runs one Gitea per Sovereign on the management cluster (see docs/PLATFORM-TECH-STACK.md §2.3). Each workload region's Flux pulls from that single Gitea over the cross-region network. Per-region HA comes from Gitea's intra-cluster replicas + CNPG-backed metadata, not cross-region bidirectional mirror.

flowchart TB
    subgraph Mgt["Management cluster (per Sovereign)"]
        Gitea[Gitea — single instance, K8s-replicated for HA]
    end

    subgraph Region1["Workload region 1 (rtz)"]
        Flux1[Per-vcluster Flux]
        K8s1[Org vcluster workloads]
    end

    subgraph Region2["Workload region 2 (rtz)"]
        Flux2[Per-vcluster Flux]
        K8s2[Org vcluster workloads]
    end

    Gitea --> Flux1
    Gitea --> Flux2
    Flux1 --> K8s1
    Flux2 --> K8s2

One Gitea per Sovereign (on the mgt cluster). HA via intra-cluster replicas, not cross-region mirror.
Each vcluster runs its own lightweight Flux that pulls from this single Gitea.
Per-region configuration is encoded in the Environment's manifests via Kustomize overlays / Placement metadata.

Secrets Management

Flux uses External Secrets Operator (ESO) with PushSecrets pattern:

No SOPS: SOPS has been eliminated from the architecture
PushSecrets: 100% PushSecrets pattern for multi-region
K8s Secrets as source of truth: Apps read from K8s Secrets only

Release Management

Release Flow

Feature Branch → PR → main → Flux Sync → Staging → Promote → Production
                                            ↓
                                      Canary Analysis (Flagger)
                                            ↓
                                      Auto-Rollback (on failure)

Environment Strategy

Environment	Sync Interval
Staging	1m
Production	10m

Rollback Strategy

Trigger	Action
Flagger metric failure	Auto-rollback
Manual intervention	Git revert + sync
Emergency	`kubectl rollout undo`

Bootstrap

# Initial cluster bootstrap (Gitea)
flux bootstrap git \
  --url=https://gitea.<location-code>.<sovereign-domain>/openova/flux \
  --branch=main \
  --path=clusters/<region> \
  --token-auth

Key Commands

# Status overview
flux get all

# Force reconciliation
flux reconcile kustomization organizations

# View logs
flux logs --all-namespaces

# Suspend (manual gate)
flux suspend kustomization <org>-prod

Sync Intervals

Resource	Interval
GitRepository	1 minute
Kustomization	10 minutes
HelmRelease	10 minutes

Gitea Actions Integration

Gitea Actions can trigger Flux reconciliation:

# .gitea/workflows/notify-flux.yaml
name: Notify Flux
on:
  push:
    branches: [main]

jobs:
  notify:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger Flux reconciliation
        run: |
          curl -X POST \
            -H "Authorization: Bearer ${{ secrets.FLUX_WEBHOOK_TOKEN }}" \
            https://flux-webhook.<location-code>.<sovereign-domain>/hook/...

Part of OpenOva