feat(gitops-update): manifest-driven topology + anacleto cluster#212
feat(gitops-update): manifest-driven topology + anacleto cluster#212
Conversation
…to support Introduce config/deployment-matrix.yaml as the single source of truth for which apps deploy to which clusters. The workflow now reads the manifest at the same pinned ref as itself (sparse checkout) and resolves the cluster set from app_name. Adds anacleto as the third deployment target. deploy_in_<cluster> inputs become force-off overrides — they subtract clusters from the manifest-resolved set but cannot add a cluster the manifest does not list. This prevents accidental cross-cluster spillover while still allowing emergency containment. Adds src/lint/deployment-matrix composite (Python embedded, follows the composite-schema pattern) that validates schema, app/cluster integrity, duplicates, and orphan apps. Wired into self-pr-validation as a gated job that only runs when config/deployment-matrix.yaml changes. The manifest topology was inferred empirically from the GitOps repo by cross-referencing folder presence with CI commit history — only apps that are real callers of this workflow are included (excludes apps managed manually like underwriter, jd-mock-api, mock-btg-server, control-plane, platform-console, ledger, dockerhub-secret).
WalkthroughMigrates GitOps deployment from caller-specified server flags to manifest-driven cluster resolution via config/deployment-matrix.yml; adds deploy_in_anacleto and deployment_matrix_file inputs, a deployment-matrix lint action and CI job, and changes workflow steps to resolve clusters and apply force-off suppression. Breaking: callers now require the manifest. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🛡️ CodeQL Analysis ResultsLanguages analyzed: Found 2 issue(s): 2 Medium
🔍 View full scan logs | 🛡️ Security tab |
🔍 Lint Analysis
|
…tution Address PR #212 lint failures: 1. Pin all `actions/checkout@v6` occurrences in self-pr-validation.yml to the SHA already used in gitops-update.yml. Required by pinned-actions lint for external (non-LerianStudio) actions. Also clears pre-existing tech debt in this file that surfaced because the new deployment-matrix job touched it. 2. Replace `echo "$RESOLVED" | sed 's/^/ - /'` with a bash `while read` loop in the resolve_clusters step. Fixes shellcheck SC2001 (prefer bash parameter expansion over sed for simple substitutions).
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/gitops-update-workflow.md (1)
199-209:⚠️ Potential issue | 🟡 MinorFinish the terminology migration from
servertocluster.These sections still document the path placeholders, sync naming, and examples in terms of
<server>/ “selected servers”, even though the workflow now resolves clusters from the deployment matrix. That leaves the docs internally inconsistent right where callers look for path and ArgoCD naming rules.As per coding guidelines,
docs/gitops-update-workflow.mdmust update naming/path conventions to use<cluster>in GitOps paths and ArgoCD app naming (not the old single-server<server>terminology).Also applies to: 240-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/gitops-update-workflow.md` around lines 199 - 209, The docs still use the old placeholder `<server>` and phrases like “selected servers” in the GitOps path and ArgoCD naming examples; update all occurrences in docs/gitops-update-workflow.md to use `<cluster>` and “selected clusters” instead, e.g. change the path template to gitops/environments/<cluster>/helmfile/applications/<env>/<app_name>/values.yaml and adjust any ArgoCD app naming examples to reference cluster (also apply the same replacements in the later section referenced as lines 240-263); ensure wording and examples are consistent with the deployment matrix terminology and keep the `<env>` values unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gitops-update.yml:
- Around line 195-198: The command that sets RESOLVED currently silences errors
with "|| true", masking parse/query failures as an empty RESOLVED; remove the
"|| true" fallback so yq/jq errors propagate and fail the job, i.e. change the
RESOLVED assignment (the pipeline using yq -o=json '.' "$MANIFEST" | jq -r --arg
app "$APP_NAME" '.clusters | to_entries[] | select(.value.apps // [] |
index($app)) | .key' | sort -u) to not append "|| true"; ensure the rest of the
script still treats an actual empty RESOLVED as the legitimate “no matching
clusters” case and let the pipeline fail on real yq/jq errors.
In `@config/deployment-matrix.yaml`:
- Around line 1-109: The file was added with a .yaml extension which violates
the repo convention; rename deployment-matrix.yaml to deployment-matrix.yml and
update all references in this PR (workflow inputs, docs, and any helper scripts)
so the new path is used; ensure the renamed manifest still contains the same
top-level keys (version, apps.registry, clusters) and that any CI/workflow that
reads the file (gitops-update.yml or helper script) is updated to point to
config/deployment-matrix.yml before merging.
---
Outside diff comments:
In `@docs/gitops-update-workflow.md`:
- Around line 199-209: The docs still use the old placeholder `<server>` and
phrases like “selected servers” in the GitOps path and ArgoCD naming examples;
update all occurrences in docs/gitops-update-workflow.md to use `<cluster>` and
“selected clusters” instead, e.g. change the path template to
gitops/environments/<cluster>/helmfile/applications/<env>/<app_name>/values.yaml
and adjust any ArgoCD app naming examples to reference cluster (also apply the
same replacements in the later section referenced as lines 240-263); ensure
wording and examples are consistent with the deployment matrix terminology and
keep the `<env>` values unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9dc319dc-e1d1-407d-8f29-4e3e18ff298f
📒 Files selected for processing (8)
.github/workflows/gitops-update.yml.github/workflows/self-pr-validation.ymlconfig/deployment-matrix.yamldocs/gitops-update-workflow.mdsrc/lint/deployment-matrix/README.mdsrc/lint/deployment-matrix/action.ymlsrc/notify/pr-lint-reporter/README.mdsrc/notify/pr-lint-reporter/action.yml
Resolves 6 medium-severity findings from github-advanced-security:
CODE INJECTION (4 findings — actions/code-injection/medium):
- Move `${{ github.workflow_ref }}` to step env: WORKFLOW_REF
- Bonus: replace `echo | sed -E 's|.*@||'` with bash `${VAR##*@}`
- Eliminates injection vectors at lines 106 + 108
- Move resolve_clusters outputs (has_clusters, clusters) to step env:
HAS_CLUSTERS + RESOLVED_SERVERS in apply_tags step
- Move inputs.yaml_key_mappings + inputs.configmap_updates to step env:
MAPPINGS + CONFIGMAP_MAPPINGS
- Replace `${{ env.IS_BETA/RC/PRODUCTION/SANDBOX }}` with direct
`$IS_BETA/...` (already in job-level env, no need to re-interpolate)
- Replace `${{ github.ref }}` with `${GITHUB_REF}` (auto-set by runner)
UNTRUSTED CHECKOUT (2 findings — actions/untrusted-checkout/medium):
- Add `persist-credentials: false` to manifest sparse checkout (read-only,
no credentials needed, never executes code from this checkout)
- Document trust model inline for the GitOps repo checkout (workflow_call
is not triggered by untrusted PRs; inputs.gitops_repository comes from
trusted internal callers; MANAGE_TOKEN is required for the subsequent
commit/push step, so we cannot drop persist-credentials there)
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/gitops-update.yml (1)
195-198:⚠️ Potential issue | 🟠 MajorDo not swallow manifest/query failures here.
The trailing
|| truestill turns malformed YAML, missing tools, or a brokenjqquery into the same empty-resolution path as “app not registered”, so the workflow exits successfully instead of surfacing a real topology failure.Suggested fix
RESOLVED=$(yq -o=json '.' "$MANIFEST" \ | jq -r --arg app "$APP_NAME" \ '.clusters | to_entries[] | select(.value.apps // [] | index($app)) | .key' \ - | sort -u || true) + | sort -u)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gitops-update.yml around lines 195 - 198, The RESOLVED assignment is swallowing failures because of the trailing "|| true"; update the RESOLVED logic (the pipeline using yq/jq with MANIFEST and APP_NAME) to detect and propagate real errors instead of converting them to an empty value: remove the "|| true", capture and check the pipeline/command exit status (or enable errexit for the step) and, on error, log the failing command and exit non‑zero so malformed YAML, missing tools, or bad jq queries fail the workflow rather than being treated as “app not registered.”
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gitops-update.yml:
- Around line 99-119: The code currently parses github.workflow_ref (which
points to the caller) and may checkout the wrong ref; add a required workflow
input named workflow_ref (default callers should pass ${{ github.workflow_ref }}
or their known pinned ref) and replace the sed parsing logic with using that
input directly: stop extracting REF from github.workflow_ref in the Resolve
shared-workflows ref step and instead set ref="${{ inputs.workflow_ref }}" (or
assign it to GITHUB_OUTPUT via steps.shared_ref) so the Checkout deployment
matrix manifest step uses the passed workflow_ref; update the reusable workflow
inputs to include workflow_ref and adjust any references to
steps.shared_ref.outputs.ref or similar to source the new input.
---
Duplicate comments:
In @.github/workflows/gitops-update.yml:
- Around line 195-198: The RESOLVED assignment is swallowing failures because of
the trailing "|| true"; update the RESOLVED logic (the pipeline using yq/jq with
MANIFEST and APP_NAME) to detect and propagate real errors instead of converting
them to an empty value: remove the "|| true", capture and check the
pipeline/command exit status (or enable errexit for the step) and, on error, log
the failing command and exit non‑zero so malformed YAML, missing tools, or bad
jq queries fail the workflow rather than being treated as “app not registered.”
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5a5313c9-d193-45b6-9ffd-2619d0b198e0
📒 Files selected for processing (2)
.github/workflows/gitops-update.yml.github/workflows/self-pr-validation.yml
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gitops-update.yml:
- Around line 18-29: The deploy_in_* inputs (deploy_in_firmino,
deploy_in_clotilde, deploy_in_anacleto) were changed to "force-off only"
semantics which can silently break callers who set them to true expecting to add
clusters; update the code and docs to mitigate this by (1) documenting the
semantic change in the migration guide and calling out that apps must be listed
in the deployment matrix to be deployed, and (2) adding runtime warnings in the
deployment input-processing logic that detect when a caller passed
deploy_in_<cluster>: true but the app is not present in that cluster's manifest
(emit a clear warning mentioning the cluster and app name); implement the
warning near the existing validation that currently warns only when an app is in
no cluster so it reuses the same validation flow and message style.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2b5758c8-abf1-471d-ab63-9475e540737a
📒 Files selected for processing (1)
.github/workflows/gitops-update.yml
1. [CRITICAL] Replace `github.workflow_ref` with `github.job_workflow_sha` for manifest checkout. In reusable workflows, github.workflow_ref points to the CALLER's workflow file/ref, not the called reusable workflow — my previous design would have failed for every external caller. `job_workflow_sha` is the commit SHA of the running reusable workflow, which is exactly what we need. Bonus: SHA is more secure than textual ref, and removes the need for the `Resolve shared-workflows ref` step entirely (−18 lines). 2. [HIGH] Remove `|| true` from the RESOLVED pipeline. Silenced yq/jq failures would collapse into the "app not registered" warning path, hiding real manifest/query errors. Now fails fast on parse errors; empty RESOLVED from a successful query remains the legitimate "no matching clusters" case (handled explicitly below). 3. [MEDIUM] Rename config/deployment-matrix.yaml → .yml to match the repo convention (77 .yml files vs 2 .yaml). Updated all references: workflow input default, self-pr-validation gate, composite default, README docs, and the workflow doc. 4. [LOW] Add prominent migration callout to docs about deploy_in_* semantic change — apps must be in the manifest; inputs only subtract. Declined: per-cluster warning when deploy_in_<cluster>: true but app is absent from that cluster's manifest list. Inputs default to true, so this would fire for every app missing from any cluster on every run — noise without signal. Existing "app in zero clusters" warning already covers the actionable case.
actionlint v1.7.x (pinned via raven-actions/actionlint@v2.1.2) does not
yet include `github.job_workflow_sha` in its GitHub context schema,
triggering a false-positive "property not defined" error on the previous
direct reference.
Replace the inline `${{ github.job_workflow_sha }}` expression with an
intermediate step that reads the equivalent auto-set env var
GITHUB_JOB_WORKFLOW_SHA and exports it as a step output. Functionally
identical (the runner populates both from the same source) but the
`steps.X.outputs.Y` expression is recognized by actionlint.
Also adds a defensive guard that fails fast if GITHUB_JOB_WORKFLOW_SHA
is empty — which would mean the workflow is being called outside a
reusable-workflow context, catching that misconfiguration loudly.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/gitops-update.yml (2)
7-33:⚠️ Potential issue | 🟠 MajorAdd the required
dry_runinput before merging.This reusable workflow still mutates the GitOps repo and can trigger ArgoCD sync, but
on.workflow_call.inputsdoes not expose adry_runboolean. That leaves callers without the required preview path for the new manifest-driven rollout. As per coding guidelines, "Every reusable workflow must supportworkflow_calltrigger, expose explicitinputs, and always include adry_runinput (type: boolean,default: false)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gitops-update.yml around lines 7 - 33, The workflow is missing the required dry_run input under workflow_call.inputs; add a new boolean input named dry_run with type: boolean and default: false alongside the existing inputs (e.g., near gitops_repository, app_name, deploy_in_firmino, etc.) so callers can opt into preview mode; ensure the input key is exactly dry_run and its default is false to satisfy the reusable-workflow guideline and avoid mutating the GitOps repo when callers request a dry run.
71-83:⚠️ Potential issue | 🟠 MajorDeclare an explicit workflow-level
permissions:block.
permissionsis still undeclared, so this workflow inherits repository defaults. For a shared reusable workflow that checks out repositories and downloads artifacts, that is broader than the repo baseline and should be pinned to least privilege explicitly. As per coding guidelines, "Always declare explicitpermissions:block at workflow level; default tocontents: read; only escalate per-job when needed."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gitops-update.yml around lines 71 - 83, Add an explicit workflow-level permissions block to the reusable workflow so it no longer inherits repo defaults: declare permissions: contents: read (and any other least-privilege scopes required) at the top-level of the workflow file that contains the update_gitops job, then if a specific step or job (e.g., update_gitops job, steps: checkout or artifact download) needs extra scopes, escalate only within that job via a per-job permissions override; ensure the workflow-level block is present before jobs: and limits privileges to the minimum.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gitops-update.yml:
- Around line 215-220: The current "no-cluster" branches set has_clusters=false
in the resolve step (writing to GITHUB_OUTPUT) but downstream artifact
download/validation and update/commit steps still run and can fail before the
no-op is honored; update the workflow so any steps that download artifacts,
validate, or perform updates (the artifact/download/validate and commit steps)
are conditional on steps.resolve_clusters.outputs.has_clusters == 'true' (or
move an early conditional/if that exits the job immediately when RESOLVED is
empty or has_clusters == 'false'), and ensure the resolve step writes the output
key has_clusters consistently (GITHUB_OUTPUT) so the later if conditions (e.g.,
uses of steps.resolve_clusters.outputs.has_clusters) can gate those steps and
prevent them from running when an app is not in any cluster.
---
Outside diff comments:
In @.github/workflows/gitops-update.yml:
- Around line 7-33: The workflow is missing the required dry_run input under
workflow_call.inputs; add a new boolean input named dry_run with type: boolean
and default: false alongside the existing inputs (e.g., near gitops_repository,
app_name, deploy_in_firmino, etc.) so callers can opt into preview mode; ensure
the input key is exactly dry_run and its default is false to satisfy the
reusable-workflow guideline and avoid mutating the GitOps repo when callers
request a dry run.
- Around line 71-83: Add an explicit workflow-level permissions block to the
reusable workflow so it no longer inherits repo defaults: declare permissions:
contents: read (and any other least-privilege scopes required) at the top-level
of the workflow file that contains the update_gitops job, then if a specific
step or job (e.g., update_gitops job, steps: checkout or artifact download)
needs extra scopes, escalate only within that job via a per-job permissions
override; ensure the workflow-level block is present before jobs: and limits
privileges to the minimum.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9dbe78ac-275b-42a0-a723-793548e8c828
📒 Files selected for processing (1)
.github/workflows/gitops-update.yml
| if [[ -z "$RESOLVED" ]]; then | ||
| echo "::warning::App '$APP_NAME' is not registered in any cluster of the deployment matrix." | ||
| echo "If this is a new app, add it to $MANIFEST in shared-workflows." | ||
| echo "clusters=" >> "$GITHUB_OUTPUT" | ||
| echo "has_clusters=false" >> "$GITHUB_OUTPUT" | ||
| exit 0 |
There was a problem hiding this comment.
The new zero-cluster path does not actually exit cleanly.
These branches set has_clusters=false, but nothing consumes that until Line 368. The artifact download/validation steps still run first, so an app that is missing from the matrix—or fully suppressed by the force-off inputs—can still fail before the intended no-op path is reached. Gate the artifact/update/commit steps on steps.resolve_clusters.outputs.has_clusters == 'true', or branch earlier so the job stops immediately.
Also applies to: 257-261, 366-372
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/gitops-update.yml around lines 215 - 220, The current
"no-cluster" branches set has_clusters=false in the resolve step (writing to
GITHUB_OUTPUT) but downstream artifact download/validation and update/commit
steps still run and can fail before the no-op is honored; update the workflow so
any steps that download artifacts, validate, or perform updates (the
artifact/download/validate and commit steps) are conditional on
steps.resolve_clusters.outputs.has_clusters == 'true' (or move an early
conditional/if that exits the job immediately when RESOLVED is empty or
has_clusters == 'false'), and ensure the resolve step writes the output key
has_clusters consistently (GITHUB_OUTPUT) so the later if conditions (e.g., uses
of steps.resolve_clusters.outputs.has_clusters) can gate those steps and prevent
them from running when an app is not in any cluster.
Description
Introduces
config/deployment-matrix.yamlas the single source of truth for which apps deploy to which Kubernetes clusters via thegitops-update.ymlreusable workflow. Adds anacleto as the third deployment target alongside firmino and clotilde.The workflow now:
app_nameby scanningclusters.<name>.apps.deploy_in_<cluster>inputs as force-off overrides — they can subtract clusters from the resolved set but cannot add a cluster the manifest doesn't list. Prevents accidental cross-cluster spillover while still allowing emergency containment.A new
src/lint/deployment-matrixcomposite validates the manifest schema, app/cluster integrity, and duplicates on every PR that touchesconfig/deployment-matrix.yaml(gated job inself-pr-validation.yml, follows the same Python-embedded pattern ascomposite-schema).The manifest topology was inferred empirically from the
midaz-firmino-gitopsrepo by cross-referencing folder presence with CI commit history (ci(<app>): update image tags), so only real workflow callers are included — manually-managed apps likeunderwriter,jd-mock-api,mock-btg-server,control-plane,platform-console,ledger,dockerhub-secretare excluded.Cluster membership today (18 apps):
firmino: 12 apps (core platform + plugins + plugin-br-bank-transfer)clotilde: 17 apps (core + plugins + plugin-br-bank-transfer + Lerian platform suite: backoffice-console, cs-platform, forge, lerian-map, tenant-manager)anacleto: 12 apps (core + plugins + plugin-br-bank-transfer-jd)Files affected:
.github/workflows/gitops-update.yml— manifest loader, resolve_clusters step, force-off semanticsconfig/deployment-matrix.yaml— new manifest (source of truth)src/lint/deployment-matrix/{action.yml,README.md}— new composite validator.github/workflows/self-pr-validation.yml— wires the new lint jobsrc/notify/pr-lint-reporter/{action.yml,README.md}— adds deployment-matrix result row to the PR commentdocs/gitops-update-workflow.md— new "Deployment Matrix" section, updated examples, migration guideType of Change
feat: New workflow or new input/output/step in an existing workflowfix: Bug fix in a workflowperf: Performance improvementrefactor: Internal restructuring with no behavior changedocs: Documentation onlyci: Changes to self-CIchore: Dependency bumps, config updates, maintenancetest: Adding or updating testsBREAKING CHANGE: Callers must update their configuration after this PRBreaking Changes
None. This is fully backward-compatible:
deploy_in_firmino: true, deploy_in_clotilde: truecontinue working — the inputs are now force-off (true is no-op, false subtracts).Testing
yaml.safe_load)plugin-br-bank-transfervsplugin-br-bank-transfer-jdedge case (caught a critical bug:jq contains([$app])does substring matching on strings — switched toindex($app)for exact equality)@developor the beta tagCaller repo / workflow run: Pending real release after merge.
Related Issues
Closes #
Summary by CodeRabbit