openova/platform/grafana
e3mrah 0a45a790e7
fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909)
PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone
Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to
`https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)
when more than one parent zone is enabled. Every public HTTPRoute pinned
to `sectionName: https` got `Accepted=False NoMatchingListener` and the
hosted service 404'd / connection-refused.

That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes
shipped the same `sectionName: https` default in values.yaml, so on a
multi-zone Sovereign every blueprint route — gitea, grafana, harbor,
keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed
to attach. TBD-A40 / issue #1902.

Sweep verbatim:

  $ git grep -nE 'sectionName:[[:space:]]+(https|"https")[[:space:]]*$' \
      platform/*/chart/ products/ clusters/ core/ 2>/dev/null \
      | grep -v 'platform/gateway-api/chart/templates'
  platform/gitea/chart/values.yaml:168:    sectionName: https
  platform/grafana/chart/values.yaml:124:    sectionName: https
  platform/harbor/chart/values.yaml:437:    sectionName: https
  platform/keycloak/chart/values.yaml:482:    sectionName: https
  platform/newapi/chart/values.yaml:721:      sectionName: https
  platform/openbao/chart/values.yaml:72:    sectionName: https
  platform/powerdns/chart/values.yaml:407:      sectionName: https
  platform/stalwart-tenant/chart/values.yaml:297:      sectionName: https
  products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802:        sectionName: https

Fix (Option C — omit sectionName, same as PR #1888):

  - 8 blueprint values.yaml defaults flipped from `sectionName: https` to
    `sectionName: ""`. The chart templates already guard with `{{- with
    .Values.gateway.parentRef.sectionName }}`, so a blank value drops the
    field entirely and Cilium Gateway matches by hostname filter.

  - platform/newapi/chart/templates/httproute.yaml was the outlier: it
    used `default "https" $parent.sectionName` which fell back to `https`
    even when values.yaml said empty. Rewritten to `{{- with
    $parent.sectionName }}` so empty drops the field — same pattern as
    the other 7 blueprints.

  - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
    renders a per-tenant bp-keycloak HelmRelease and injected
    `sectionName: https` into spec.values. Flipped to `sectionName: ""`
    so the bp-keycloak chart's `{{- with }}` guard drops the field.

Validation (real `helm template`, default values, gateway enabled, no
sectionName override) — Principle #15:

  gitea            : sectionName lines in rendered output = 0
  grafana          : sectionName lines in rendered output = 0
  harbor           : sectionName lines in rendered output = 0
  keycloak         : sectionName lines in rendered output = 0
  openbao          : sectionName lines in rendered output = 0
  powerdns         : sectionName lines in rendered output = 0
  newapi           : sectionName lines in rendered output = 0
  stalwart-tenant  : sectionName lines in rendered output = 0

Override path preserved — `--set ...parentRef.sectionName=https-omani-works`
on each chart renders `sectionName: "https-omani-works"` correctly,
so operators on single-zone clusters or non-Cilium gateways can still
pin explicitly via bootstrap-kit overlay.

helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint
error is pre-existing on origin/main, unrelated to this fix).

Chart bumps (each blueprint also bumps blueprint.yaml spec.version per
#817 lockstep):
  bp-gitea            1.2.7  -> 1.2.8
  bp-grafana          1.0.1  -> 1.0.2
  bp-harbor           1.2.17 -> 1.2.18
  bp-keycloak         1.4.5  -> 1.4.6
  bp-newapi           1.4.22 -> 1.4.23
  bp-openbao          1.2.16 -> 1.2.17
  bp-powerdns         1.2.3  -> 1.2.4
  bp-stalwart-tenant  0.1.2  -> 0.1.3

Refs TBD-A40.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:57:12 +04:00
..
chart fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
blueprint.yaml fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909) 2026-05-19 07:57:12 +04:00
README.md docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay 2026-04-28 10:23:46 +02:00

Grafana Stack

LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.

Status: Accepted | Updated: 2026-04-27


Overview

The Grafana Stack provides unified observability with:

  • Loki - Log aggregation
  • Grafana - Visualization
  • Tempo - Distributed tracing
  • Mimir - Metrics storage
  • Alloy - Telemetry collection

Architecture

flowchart TB
    subgraph Apps["Applications"]
        App1[App 1]
        App2[App 2]
        OTel[OTel SDK]
    end

    subgraph Alloy["Grafana Alloy"]
        Collector[Telemetry Collector]
    end

    subgraph Storage["Storage Layer"]
        Loki[Loki<br/>Logs]
        Tempo[Tempo<br/>Traces]
        Mimir[Mimir<br/>Metrics]
    end

    subgraph Tier["Tiered Storage"]
        Hot[Hot: Local]
        Cold[Cold: SeaweedFS]
        Archive[Archive: R2]
    end

    subgraph UI["Visualization"]
        Grafana[Grafana]
    end

    App1 --> Collector
    App2 --> Collector
    OTel --> Collector
    Collector --> Loki
    Collector --> Tempo
    Collector --> Mimir
    Loki --> Hot
    Hot --> Cold
    Cold --> Archive
    Grafana --> Loki
    Grafana --> Tempo
    Grafana --> Mimir

Components

Component Purpose Memory
Grafana Alloy Telemetry collection (OTLP, Prometheus) 256MB
Loki Log aggregation 512MB
Tempo Distributed tracing 256MB
Mimir Metrics storage 512MB
Grafana Visualization 256MB

Tiered Storage

flowchart LR
    subgraph Hot["Hot (7 days)"]
        Local[Local PV]
    end

    subgraph Warm["Warm (30 days)"]
        SeaweedFS[SeaweedFS]
    end

    subgraph Cold["Cold (1 year)"]
        R2[Cloudflare R2]
    end

    Local -->|"After 7d"| SeaweedFS
    SeaweedFS -->|"After 30d"| R2
Tier Duration Storage
Hot 0-7 days Local PV
Warm 7-30 days SeaweedFS
Cold 30d-1 year Cloudflare R2

Configuration

Alloy Collector

apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy-config
  namespace: monitoring
data:
  config.alloy: |
    otelcol.receiver.otlp "default" {
      grpc { endpoint = "0.0.0.0:4317" }
      http { endpoint = "0.0.0.0:4318" }
    }

    otelcol.exporter.loki "default" {
      forward_to = [loki.write.default.receiver]
    }

    otelcol.exporter.otlp "tempo" {
      client { endpoint = "tempo.monitoring.svc:4317" }
    }

    prometheus.scrape "pods" {
      targets = discovery.kubernetes.pods.targets
      forward_to = [prometheus.remote_write.mimir.receiver]
    }    

Loki with S3 Backend

loki:
  schemaConfig:
    configs:
      - from: 2024-01-01
        store: tsdb
        object_store: s3
        schema: v13

  storage:
    type: s3
    s3:
      endpoint: seaweedfs.storage.svc:8333
      bucketnames: loki-data
      access_key_id: ${SEAWEEDFS_ACCESS_KEY}
      secret_access_key: ${SEAWEEDFS_SECRET_KEY}

OpenTelemetry Integration

Applications send telemetry via OTLP:

# OTel auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default
  namespace: <org>
spec:
  exporter:
    endpoint: http://alloy.monitoring.svc:4317
  propagators:
    - tracecontext
    - baggage

Dashboards

Dashboard Purpose
Platform Overview Request rates, latencies, errors
Cilium Network Traffic flows, policy drops
Flux GitOps Reconciliation status
CNPG Postgres Database performance
AI Hub Overview LLM inference metrics
GPU Metrics Utilization, memory, temperature

Alerting

Alerts flow through Alertmanager to Gitea Actions:

flowchart LR
    Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
    AM -->|"Webhook"| GA[Gitea Actions]
    GA -->|"Auto-Remediation"| K8s[Kubernetes]

Part of OpenOva