Compare commits

...

2 Commits

Author SHA1 Message Date
e3mrah
8d8ce40045
fix(build-organization-controller): add missing auto-bump pipeline + pkg/** path filter + wire-level test (Refs #1997) (#2005)
Some checks are pending
Build organization-controller / build (push) Waiting to run
Build sandbox-controller / build (push) Waiting to run
Build sandbox-mcp-server / build (push) Waiting to run
Build catalyst-catalog / test (push) Waiting to run
Build catalyst-catalog / build (push) Blocked by required conditions
Build catalyst-catalog / notify (push) Blocked by required conditions
Vendor-coupling guardrail / Vendor-coupling guardrail (push) Waiting to run
Cluster bootstrap-kit drift guardrail / Detect bootstrap-kit drift (push) Waiting to run
Playwright UI smoke (Group L — / Playwright UI smoke (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / dependency-graph-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / pin-sync-audit (push) Waiting to run
Test — Bootstrap Kit (kind cluster + Flux) / manifest-validation (push) Blocked by required conditions
Test — Bootstrap Kit (kind cluster + Flux) / kind-reconciliation (push) Blocked by required conditions
Followup hardening for #1997 (PR #2004 catch-up bumped the
organization-controller chart pin to c9b58ea). PR #2004 unblocks t38
right now, but the underlying cause — `build-organization-controller.yaml`
has no auto-bump step and its path filter misses `core/controllers/pkg/**`
— is still live and will re-strand the next gitea-client fix the
moment it lands. This PR closes both gaps so the bug cannot recur.

Two surgical additions:

  1. `.github/workflows/build-organization-controller.yaml`
     a. Promote `permissions.contents: read` → `write` (+ `actions:
        write`), mirroring `build-application-controller.yaml`.
     b. Add `Bump controllers.organization.image.tag in values.yaml`
        step (awk-scoped to the `organization:` block only — cannot
        accidentally bump a sibling controller's tag).
     c. Add `Commit and push values.yaml bump` step (rebase-safe,
        skip-if-no-change).
     d. Add `Dispatch blueprint-release for chart re-publish` step
        — anti-recursion bypass for the GH-Actions rule that bot
        pushes don't fire downstream workflows. Without this the
        rebuilt image is NEVER baked into a new chart version.
     e. Add `core/controllers/pkg/**` to push + pull_request path
        filters. The shared HTTP-client tree (gitea, keycloak,
        kc-mappers, …) is COPYed into every Group C controller's
        image via the Containerfile, so a change to it MUST rebuild.
        PR #1910 only triggered a rebuild because it happened to
        also touch `organization_controller_test.go`; a pure pkg/
        fix would silently skip the workflow.

  2. `core/controllers/pkg/gitea/client_test.go`
     New `TestCreateOrg_HitsOrgsEndpointWithAuth` — wire-level
     regression guard that:
     - Fails hard if the client EVER hits `/api/v1/admin/orgs` (would
       catch a refactor accident that re-introduces the Gitea 1.22+
       405 bug regardless of which chart pin is deployed).
     - Asserts the request is `POST /api/v1/orgs` exactly once.
     - Asserts the request carries `Authorization: token <hex>` with
       the exact expected value (defense-in-depth: even if the URL
       is right, Gitea 1.22+ still returns 405 without the token).

Sibling controllers (environment, blueprint, useraccess, …) likely
have the same missing-auto-bump + missing-pkg/** path filter. NOT
fixing them in this PR — blast-radius discipline. Follow-up
recommended: audit every `build-*-controller.yaml` for both gaps.

Validation:
  • go vet ./pkg/gitea/... — clean
  • go test -race ./pkg/gitea/... — ok, all pre-existing + new tests pass
  • go test -run TestCreateOrg_HitsOrgsEndpointWithAuth -v — PASS

Refs #1997 (PR #2004 closed the immediate symptom; this PR closes
the deploy gap so #1997 cannot recur)
Refs #1910 (the original /admin/orgs → /orgs code fix)
Refs #1829 (D29 customer journey hardening)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:29:59 +04:00
e3mrah
1ca37ea7f8
fix(bp-valkey): default auth.enabled=false to match bp-newapi passwordless REDIS_CONN_STRING (Closes #2003) (#2007)
Pre-1.0.2 bp-valkey shipped `valkey.auth.enabled: true` (bitnami default)
while bp-newapi's REDIS_CONN_STRING default was the passwordless URL
`redis://valkey-primary.valkey.svc.cluster.local:6379`. On every
freshly-franchised Sovereign the newapi Pod CrashLoopBackOff'd 45x on
the Redis ping probe with `NOAUTH Authentication required` — caught
on t38 sandbox walk 2026-05-20. This is the Pillar-4 verifier-killing
bug for the Sandbox + qwen-code + MCP end-user DoD (#1986).

Approach A (simpler, this PR): flip bp-valkey's default to
`auth.enabled: false` so the upstream bitnami chart exports
`ALLOW_EMPTY_PASSWORD=yes` to the Valkey container. Verified via
`helm template` — the render now contains:

    - name: ALLOW_EMPTY_PASSWORD
      value: "yes"

Other in-cluster consumers tolerate the change:
  - products/catalyst sme-services (auth.yaml + gateway.yaml) read
    VALKEY_PASSWORD via `secretKeyRef ... optional: true` and fall
    back to the no-auth connect path in
    core/services/shared/db/valkey.go when the value is empty.
  - products/catalyst projector wraps the password Secret mount in
    `{{- with .Values.services.projector.valkey.passwordSecret }}`
    so an absent Secret simply skips the password env var.

Approach B (deferred): make bp-newapi mirror the bp-valkey
auto-generated password Secret into the newapi namespace and template
it into REDIS_CONN_STRING. Larger scope, tracked under #2003 follow-up.

Changes:
  - platform/valkey/chart/values.yaml — auth.enabled: true → false
  - platform/valkey/chart/Chart.yaml — version 1.0.1 → 1.0.2
  - platform/valkey/blueprint.yaml — spec.version + configSchema default
  - clusters/_template/bootstrap-kit/17-valkey.yaml — chart pin 1.0.1 → 1.0.2

Verified:
  - `helm dependency build` succeeds (bitnami/valkey 5.5.1 unchanged)
  - `helm template` renders `ALLOW_EMPTY_PASSWORD=yes` on the Pod
  - tests/observability-toggle.sh — all 4 cases PASS

Closes #2003
Refs #1986

Co-authored-by: hatiyildiz <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:26:56 +04:00
6 changed files with 234 additions and 11 deletions

View File

@ -20,6 +20,15 @@ on:
paths:
- 'core/controllers/organization/**'
- 'core/controllers/internal/**'
# core/controllers/pkg/** is the shared HTTP-client tree (gitea,
# keycloak, kc-mappers, …) consumed by every Group C controller's
# Containerfile via `COPY core/controllers/pkg`. Without this path
# entry a change like PR #1910 (gitea-client /admin/orgs → /orgs)
# rebuilds the image only if the same PR also happens to touch
# files under organization/ — which silently held the t38 #1997
# gitea-405 fix in main for ~12h. Mirror in every sibling
# build-*-controller.yaml.
- 'core/controllers/pkg/**'
- 'core/controllers/go.mod'
- 'core/controllers/go.sum'
- '.github/workflows/build-organization-controller.yaml'
@ -28,6 +37,7 @@ on:
paths:
- 'core/controllers/organization/**'
- 'core/controllers/internal/**'
- 'core/controllers/pkg/**'
- 'core/controllers/go.mod'
- 'core/controllers/go.sum'
- '.github/workflows/build-organization-controller.yaml'
@ -41,9 +51,23 @@ jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
# contents: write — the deploy job below pushes a values.yaml SHA
# bump back to main so the bp-catalyst-platform chart picks up the
# newly-built image without an operator manually editing the file
# (per `feedback_no_mvp_no_workarounds.md` rule 1: target-state,
# never "manual follow-up bump"). Pre-#1997 this workflow shipped
# WITHOUT this auto-bump, so PR #1910's gitea-client /admin/orgs
# → /orgs fix sat in main for ~12h while the chart pin stayed
# frozen at 72e3f08, leaving t38's organization-controller
# looping HTTP 405 and blocking D29 end-to-end. Mirrors the
# shape of build-application-controller.yaml.
contents: write
packages: write
# id-token write is required by cosign keyless signing (Sigstore).
id-token: write
# actions: write — required for `gh workflow run` to dispatch the
# downstream blueprint-release chart re-publish workflow.
actions: write
outputs:
sha_short: ${{ steps.vars.outputs.sha_short }}
digest: ${{ steps.build.outputs.digest }}
@ -124,3 +148,67 @@ jobs:
--predicate <(echo '{"sbom":"in-toto-spdx attached at build time"}') \
--type spdx \
"${IMAGE}@${DIGEST}"
# Auto-bump the chart values.yaml tag so the next Sovereign chart
# rollout picks up this image without a manual edit. Per
# `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, no
# operator-action gates) and `feedback_inviolable_principles.md`
# (event-driven, never cron). Mirrors the pattern in
# build-application-controller.yaml. Added as part of #1997 —
# without this step, PR #1910's gitea-client /admin/orgs → /orgs
# fix sat frozen in main while t38 organization-controller looped
# HTTP 405 on every Organization reconcile.
- name: Bump controllers.organization.image.tag in values.yaml
if: github.event_name != 'pull_request' && github.ref == 'refs/heads/main'
env:
SHA_SHORT: ${{ steps.vars.outputs.sha_short }}
run: |
VALUES="products/catalyst/chart/values.yaml"
# awk: find ` organization:` under `controllers:`, then update
# the next `tag: "..."` line. Stops at the next top-level key
# so we don't accidentally bump a sibling controller's tag.
awk -v sha="${SHA_SHORT}" '
/^controllers:/ { in_ctrls=1 }
in_ctrls && /^ organization:/ { print; in_org=1; next }
in_ctrls && /^ [a-z]/ && !/^ organization:/ { in_org=0 }
in_org && /^ tag:/ { sub(/"[^"]*"/, "\"" sha "\""); in_org=0 }
{ print }
' "${VALUES}" > "${VALUES}.tmp" && mv "${VALUES}.tmp" "${VALUES}"
echo "values.yaml after bump:"
grep -A4 "^ organization:" "${VALUES}" | head -10
- name: Commit and push values.yaml bump
id: deploy_commit
if: github.event_name != 'pull_request' && github.ref == 'refs/heads/main'
env:
SHA_SHORT: ${{ steps.vars.outputs.sha_short }}
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
if git diff --quiet products/catalyst/chart/values.yaml; then
echo "no values.yaml change — already pinned to ${SHA_SHORT}"
echo "pushed=false" >> "$GITHUB_OUTPUT"
exit 0
fi
git add products/catalyst/chart/values.yaml
git commit -m "deploy: bump organization-controller image to ${SHA_SHORT}"
# Pull-rebase to avoid races with parallel build commits.
git pull --rebase --autostash origin main || true
git push origin HEAD:main
echo "pushed=true" >> "$GITHUB_OUTPUT"
# GitHub Actions does NOT trigger workflows from bot pushes by
# default (anti-recursion safeguard). Without this dispatch the
# rebuilt image is NEVER baked into a new chart version, so
# Sovereigns keep installing the previous chart with the previous
# image tag (`feedback_no_mvp_no_workarounds.md` rule 1 violation).
- name: Dispatch blueprint-release for chart re-publish
if: github.event_name != 'pull_request' && github.ref == 'refs/heads/main' && steps.deploy_commit.outputs.pushed == 'true'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh workflow run blueprint-release.yaml \
--repo "${GITHUB_REPOSITORY}" \
--ref main \
-f blueprint=catalyst \
-f tree=products

View File

@ -48,7 +48,15 @@ spec:
chart:
spec:
chart: bp-valkey
version: 1.0.1
# 1.0.2 (TBD-V12 #2003, 2026-05-20): default
# `valkey.auth.enabled` flips to `false` so bp-newapi's
# passwordless REDIS_CONN_STRING default stops triggering
# `NOAUTH Authentication required` on every freshly
# franchised Sovereign (45× CrashLoopBackOff on t38 sandbox
# newapi, blocking Pillar 4 / qwen-code / MCP). See
# platform/valkey/chart/values.yaml `auth` block for the
# consumer-tolerance + follow-up plan.
version: 1.0.2
sourceRef:
kind: HelmRepository
name: bp-valkey

View File

@ -1197,3 +1197,80 @@ func TestCreatePullRequest_409ReFetchesExisting(t *testing.T) {
t.Errorf("re-fetched PR head/base wrong: %+v", pr)
}
}
// TestCreateOrg_HitsOrgsEndpointWithAuth — explicit regression test for
// issue #1997 (TBD-A68 followup of PR #1910 / issue #1906). On t38 the
// organization-controller looped on
//
// gitea.EnsureOrg: create: gitea: POST http://gitea.../api/v1/admin/orgs: HTTP 405
//
// even after PR #1910 fixed the gitea client source — because the
// chart's controllers.organization.image.tag was frozen at 72e3f08
// (no auto-bump step in build-organization-controller.yaml) so the
// running Pod predated the fix. This test ASSERTS the canonical wire-
// level invariants so the bug cannot silently regress regardless of
// the deploy pipeline state:
//
// 1. CreateOrg POSTs `/api/v1/orgs` exactly once (never the legacy
// `/api/v1/admin/orgs` which returns 405 on Gitea 1.22+).
// 2. The request carries `Authorization: token <hex>` — Gitea's
// canonical admin-token auth scheme. Without this header, even the
// correct endpoint returns 405 (Gitea's router treats the
// unauthenticated POST as "method not allowed for anonymous
// visitors").
//
// Coverage rationale: the existing TestEnsureOrg_CreatesWhenMissing
// covers the happy path through fakeGitea which already rejects empty
// auth via its 401 stub (client_test.go:66-69). This standalone test
// additionally pins the exact endpoint string + the exact Authorization
// header VALUE so a refactor cannot accidentally switch the URL or
// drop the token prefix.
func TestCreateOrg_HitsOrgsEndpointWithAuth(t *testing.T) {
t.Parallel()
var (
gotPath string
gotAuth string
hits int
)
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Regression guard: any request to the legacy admin route is a
// hard test failure. Gitea 1.22+ returns 405 on this path which
// is exactly the symptom of #1997 in the wild.
if strings.HasPrefix(r.URL.Path, "/api/v1/admin/orgs") {
t.Errorf("client used legacy /api/v1/admin/orgs route — must POST /api/v1/orgs (Gitea 1.22+ returns 405 on admin/orgs)")
http.Error(w, "405 admin/orgs is the bug", http.StatusMethodNotAllowed)
return
}
if r.Method != http.MethodPost || r.URL.Path != "/api/v1/orgs" {
http.Error(w, "unhandled "+r.Method+" "+r.URL.Path, http.StatusInternalServerError)
return
}
hits++
gotPath = r.URL.Path
gotAuth = r.Header.Get("Authorization")
_ = json.NewEncoder(w).Encode(Org{ID: 42, Username: "acme"})
}))
defer srv.Close()
c := New(srv.URL, "deadbeefcafef00d")
c.HTTP = srv.Client()
out, err := c.CreateOrg(context.Background(), "acme", "ACME", "desc", "private")
if err != nil {
t.Fatalf("CreateOrg: %v", err)
}
if out.ID != 42 || out.Username != "acme" {
t.Errorf("CreateOrg returned unexpected Org: %+v", out)
}
// Wire-level assertions: exact endpoint, exact auth scheme.
if hits != 1 {
t.Errorf("expected 1 POST hit, got %d", hits)
}
if gotPath != "/api/v1/orgs" {
t.Errorf("endpoint: got %q, want %q", gotPath, "/api/v1/orgs")
}
if want := "token deadbeefcafef00d"; gotAuth != want {
t.Errorf("Authorization header: got %q, want %q", gotAuth, want)
}
}

View File

@ -5,7 +5,7 @@ metadata:
labels:
catalyst.openova.io/section: pts-4-1-data-services
spec:
version: 1.0.1
version: 1.0.2
card:
title: Valkey
summary: |
@ -35,8 +35,14 @@ spec:
properties:
enabled:
type: boolean
default: true
description: Enforce password auth on the Valkey wire protocol.
default: false
description: |
Enforce password auth on the Valkey wire protocol.
Default false (TBD-V12 #2003) — matches bp-newapi's
passwordless REDIS_CONN_STRING contract; flipping true
requires every consumer chart (bp-newapi, catalyst
sme-services, projector) to wire the bp-valkey-
generated password into their connection strings.
metrics:
type: object
properties:

View File

@ -1,6 +1,19 @@
apiVersion: v2
name: bp-valkey
version: 1.0.1
# 1.0.2 (TBD-V12 #2003, 2026-05-20): default valkey.auth.enabled flips
# from true → false to match bp-newapi's passwordless
# REDIS_CONN_STRING default. Pre-1.0.2 every freshly-franchised
# Sovereign hit `NOAUTH Authentication required` on the newapi
# Redis ping probe — 45× CrashLoopBackOff on t38 — blocking every
# Sandbox session (Pillar 4 qwen-code + MCP) and acting as the
# verifier-killing bug for the end-user DoD (#1986). Consumers that
# DO supply a password (catalyst sme-services, projector) tolerate
# the change via `secretKeyRef optional: true` and `{{- with }}`
# gates that fall back to the no-auth connect path. See
# values.yaml under `valkey.auth` for the full rationale +
# follow-up to re-enable auth once bp-newapi mirrors the
# bp-valkey-generated password.
version: 1.0.2
description: |
Catalyst-curated Blueprint umbrella chart for Valkey (Redis-compatible
cache, BSD-3). Depends on the upstream `valkey` chart (Bitnami) as a

View File

@ -62,12 +62,43 @@ valkey:
enabled: true
size: 1Gi
# Password auth ON by default — Catalyst convention. The bitnami chart
# auto-generates a random password and exposes it via the
# `<release-name>-valkey` Secret. Cluster overlays MAY override the
# password via `auth.existingSecret` to hand-roll secret rotation.
# Password auth OFF by default (TBD-V12 #2003, 2026-05-20).
#
# ROOT CAUSE (caught on t38 walk 2026-05-20):
# The bp-newapi chart's REDIS_CONN_STRING default is the
# passwordless URL `redis://valkey-primary.valkey.svc.cluster.local:
# 6379` (platform/newapi/chart/values.yaml `valkey.url`). On every
# freshly franchised Sovereign the newapi Pod CrashLoopBackOff'd 45×
# on the Redis ping probe with `NOAUTH Authentication required`,
# blocking every Sandbox session (qwen-code + MCP) — the
# Pillar-4 verifier-killing bug.
#
# CONTRACT MATCH:
# The auth contract is set by the CONSUMER (bp-newapi). bp-newapi
# ships a passwordless connection string by default; the only way
# to make that work without churning every consumer chart is to
# flip bp-valkey's default to allow empty password. Other consumers
# that DO supply a password tolerate the change:
# - products/catalyst sme-services (auth.yaml + gateway.yaml)
# read VALKEY_PASSWORD via secretKeyRef `optional: true` and
# fall back to the no-auth connect path in
# core/services/shared/db/valkey.go when the value is empty.
# - products/catalyst projector wraps the password Secret mount
# in `{{- with .Values.services.projector.valkey.passwordSecret
# }}` so an absent Secret just disables the password env var.
#
# FOLLOW-UP (deferred, scope of #1986 Pillar 4):
# Make bp-newapi read the bp-valkey-generated password via Secret
# reflection (mirror of the bp-valkey password Secret into the
# newapi namespace, env-var REDIS_PASSWORD wired into the
# conn-string template). At that point flip this back to
# `auth.enabled: true`. Tracked under issue #2003.
#
# Cluster overlays MAY still re-enable auth by setting
# `valkey.auth.enabled: true` (and supplying matching consumer
# password wiring) on per-Sovereign installs.
auth:
enabled: true
enabled: false
# NetworkPolicy from upstream chart — already enabled by default;
# restate to make the Catalyst posture explicit. (Default-deny is