PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to `https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`) when more than one parent zone is enabled. Every public HTTPRoute pinned to `sectionName: https` got `Accepted=False NoMatchingListener` and the hosted service 404'd / connection-refused. That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes shipped the same `sectionName: https` default in values.yaml, so on a multi-zone Sovereign every blueprint route — gitea, grafana, harbor, keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed to attach. TBD-A40 / issue #1902. Sweep verbatim: $ git grep -nE 'sectionName:[[:space:]]+(https|"https")[[:space:]]*$' \ platform/*/chart/ products/ clusters/ core/ 2>/dev/null \ | grep -v 'platform/gateway-api/chart/templates' platform/gitea/chart/values.yaml:168: sectionName: https platform/grafana/chart/values.yaml:124: sectionName: https platform/harbor/chart/values.yaml:437: sectionName: https platform/keycloak/chart/values.yaml:482: sectionName: https platform/newapi/chart/values.yaml:721: sectionName: https platform/openbao/chart/values.yaml:72: sectionName: https platform/powerdns/chart/values.yaml:407: sectionName: https platform/stalwart-tenant/chart/values.yaml:297: sectionName: https products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802: sectionName: https Fix (Option C — omit sectionName, same as PR #1888): - 8 blueprint values.yaml defaults flipped from `sectionName: https` to `sectionName: ""`. The chart templates already guard with `{{- with .Values.gateway.parentRef.sectionName }}`, so a blank value drops the field entirely and Cilium Gateway matches by hostname filter. - platform/newapi/chart/templates/httproute.yaml was the outlier: it used `default "https" $parent.sectionName` which fell back to `https` even when values.yaml said empty. Rewritten to `{{- with $parent.sectionName }}` so empty drops the field — same pattern as the other 7 blueprints. - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go renders a per-tenant bp-keycloak HelmRelease and injected `sectionName: https` into spec.values. Flipped to `sectionName: ""` so the bp-keycloak chart's `{{- with }}` guard drops the field. Validation (real `helm template`, default values, gateway enabled, no sectionName override) — Principle #15: gitea : sectionName lines in rendered output = 0 grafana : sectionName lines in rendered output = 0 harbor : sectionName lines in rendered output = 0 keycloak : sectionName lines in rendered output = 0 openbao : sectionName lines in rendered output = 0 powerdns : sectionName lines in rendered output = 0 newapi : sectionName lines in rendered output = 0 stalwart-tenant : sectionName lines in rendered output = 0 Override path preserved — `--set ...parentRef.sectionName=https-omani-works` on each chart renders `sectionName: "https-omani-works"` correctly, so operators on single-zone clusters or non-Cilium gateways can still pin explicitly via bootstrap-kit overlay. helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint error is pre-existing on origin/main, unrelated to this fix). Chart bumps (each blueprint also bumps blueprint.yaml spec.version per #817 lockstep): bp-gitea 1.2.7 -> 1.2.8 bp-grafana 1.0.1 -> 1.0.2 bp-harbor 1.2.17 -> 1.2.18 bp-keycloak 1.4.5 -> 1.4.6 bp-newapi 1.4.22 -> 1.4.23 bp-openbao 1.2.16 -> 1.2.17 bp-powerdns 1.2.3 -> 1.2.4 bp-stalwart-tenant 0.1.2 -> 0.1.3 Refs TBD-A40. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| chart | ||
| blueprint.yaml | ||
| README.md | ||
Grafana Stack
LGTM observability stack (Loki, Grafana, Tempo, Mimir + Alloy collector). Per-host-cluster infrastructure (see docs/PLATFORM-TECH-STACK.md §3 / observability layer in §2.3) — runs on every host cluster a Sovereign owns. Catalyst's own self-monitoring uses this stack on the management cluster; Application telemetry from per-Org vclusters also flows here unless an Org installs its own observability stack.
Status: Accepted | Updated: 2026-04-27
Overview
The Grafana Stack provides unified observability with:
- Loki - Log aggregation
- Grafana - Visualization
- Tempo - Distributed tracing
- Mimir - Metrics storage
- Alloy - Telemetry collection
Architecture
flowchart TB
subgraph Apps["Applications"]
App1[App 1]
App2[App 2]
OTel[OTel SDK]
end
subgraph Alloy["Grafana Alloy"]
Collector[Telemetry Collector]
end
subgraph Storage["Storage Layer"]
Loki[Loki<br/>Logs]
Tempo[Tempo<br/>Traces]
Mimir[Mimir<br/>Metrics]
end
subgraph Tier["Tiered Storage"]
Hot[Hot: Local]
Cold[Cold: SeaweedFS]
Archive[Archive: R2]
end
subgraph UI["Visualization"]
Grafana[Grafana]
end
App1 --> Collector
App2 --> Collector
OTel --> Collector
Collector --> Loki
Collector --> Tempo
Collector --> Mimir
Loki --> Hot
Hot --> Cold
Cold --> Archive
Grafana --> Loki
Grafana --> Tempo
Grafana --> Mimir
Components
| Component | Purpose | Memory |
|---|---|---|
| Grafana Alloy | Telemetry collection (OTLP, Prometheus) | 256MB |
| Loki | Log aggregation | 512MB |
| Tempo | Distributed tracing | 256MB |
| Mimir | Metrics storage | 512MB |
| Grafana | Visualization | 256MB |
Tiered Storage
flowchart LR
subgraph Hot["Hot (7 days)"]
Local[Local PV]
end
subgraph Warm["Warm (30 days)"]
SeaweedFS[SeaweedFS]
end
subgraph Cold["Cold (1 year)"]
R2[Cloudflare R2]
end
Local -->|"After 7d"| SeaweedFS
SeaweedFS -->|"After 30d"| R2
| Tier | Duration | Storage |
|---|---|---|
| Hot | 0-7 days | Local PV |
| Warm | 7-30 days | SeaweedFS |
| Cold | 30d-1 year | Cloudflare R2 |
Configuration
Alloy Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: alloy-config
namespace: monitoring
data:
config.alloy: |
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
}
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
otelcol.exporter.otlp "tempo" {
client { endpoint = "tempo.monitoring.svc:4317" }
}
prometheus.scrape "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.mimir.receiver]
}
Loki with S3 Backend
loki:
schemaConfig:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
storage:
type: s3
s3:
endpoint: seaweedfs.storage.svc:8333
bucketnames: loki-data
access_key_id: ${SEAWEEDFS_ACCESS_KEY}
secret_access_key: ${SEAWEEDFS_SECRET_KEY}
OpenTelemetry Integration
Applications send telemetry via OTLP:
# OTel auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default
namespace: <org>
spec:
exporter:
endpoint: http://alloy.monitoring.svc:4317
propagators:
- tracecontext
- baggage
Dashboards
| Dashboard | Purpose |
|---|---|
| Platform Overview | Request rates, latencies, errors |
| Cilium Network | Traffic flows, policy drops |
| Flux GitOps | Reconciliation status |
| CNPG Postgres | Database performance |
| AI Hub Overview | LLM inference metrics |
| GPU Metrics | Utilization, memory, temperature |
Alerting
Alerts flow through Alertmanager to Gitea Actions:
flowchart LR
Mimir[Mimir] -->|"Alert Rules"| AM[Alertmanager]
AM -->|"Webhook"| GA[Gitea Actions]
GA -->|"Auto-Remediation"| K8s[Kubernetes]
Part of OpenOva