DEVOP-579: add NetworkPolicy egress rollout plan (doc only)#7
Open
srt0422 wants to merge 2 commits into
Open
Conversation
There was a problem hiding this comment.
4 issues found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tickets/devop-579-network-policy-rollout.md">
<violation number="1" location="tickets/devop-579-network-policy-rollout.md:33">
P2: The discovery checklist omits the DEVOP-579 requirement to explicitly flag suspect egress destinations (webhook/pastebin/ngrok/169.254.169.254).</violation>
<violation number="2" location="tickets/devop-579-network-policy-rollout.md:52">
P1: Phase 3 uses 24-hour soak windows, but linked Linear issue DEVOP-579 specifies 48-hour soaks for staged rollout.</violation>
<violation number="3" location="tickets/devop-579-network-policy-rollout.md:64">
P2: Phase 4 is missing the DEVOP-579 requirement to document the rollout/policies in SECURITY-RUNBOOK.md.</violation>
<violation number="4" location="tickets/devop-579-network-policy-rollout.md:74">
P2: This plan marks ingress NetworkPolicies as out of scope, but linked Linear issue DEVOP-579 requires default-deny for both egress and ingress.</violation>
</file>
Architecture diagram
sequenceDiagram
participant K8s as Kubernetes Clusters (13)
participant CNI as CNI Plugin (Calico/Cilium)
participant Hubble as Hubble/Flow Logs
participant Harbor as Harbor Registry
participant Kyverno as Kyverno Policy Engine
participant Runbook as Rollback Runbook
Note over K8s,Runbook: Phase 0 — Pre-flight Assessment
K8s->>CNI: Confirm NetworkPolicy support
alt CNI supports NetworkPolicy
CNI-->>K8s: Calico, Cilium, or Antrea confirmed
else Flannel without --network-policy
CNI-->>K8s: Need to migrate CNI first
end
K8s->>Hubble: Enable flow logs on staging cluster
Note over Hubble: Capture 7 days baseline traffic
Note over K8s,Rollback: Phase 1 — Discovery (per namespace)
loop For each namespace in priority order
K8s->>Hubble: Query egress flow logs (7 days)
Hubble-->>K8s: Destination CIDRs, DNS, ports
K8s->>K8s: Categorize traffic (internal/infra/vendor/registries/customer)
K8s->>K8s: Document in network-policies/discovery/<namespace>.md
end
Note over K8s,Rollback: Phase 2 — Allowlist Authoring
K8s->>K8s: Create default-deny.yaml (deny all egress except DNS)
K8s->>K8s: Create allowlist.yaml (derived from Phase 1)
Note over K8s: DNS to kube-dns/coredns (53/udp, 53/tcp)
Note over K8s: NTP always allowed (123/udp)
Note over K8s: Cluster-internal pod-to-pod allowed by default
K8s->>Harbor: Dependency on DEVOP-589 (Harbor proxy-cache)
alt DEVOP-589 landed
Note over K8s: Allowlists reference Harbor proxy instead of direct registries
else Not yet landed
Note over K8s: Allowlists must allow direct ghcr.io, docker.io, etc.
end
Note over K8s,Rollback: Phase 3 — Staged Rollout
K8s->>K8s: Day 1: Apply to 1 staging namespace, observe 24h
K8s->>K8s: Day 2: Apply to all staging namespaces, observe 24h
K8s->>K8s: Day 3: Apply to 1 production namespace (lowest risk), observe 24h
K8s->>K8s: Days 4-5: Roll forward remaining namespaces (lowest blast-radius first)
alt Egress broken for workload
K8s->>Runbook: kubectl delete networkpolicy default-deny -n <ns>
Runbook-->>K8s: Egress restored immediately
end
Note over K8s,Rollback: Phase 4 — Steady State
alt DEVOP-588 landed (Kyverno on all clusters)
Kyverno->>K8s: Auto-flag new namespaces without default-deny
K8s->>K8s: Monthly review of discovery documents
else Not yet landed
Note over K8s: Manual enforcement only
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
NetworkPolicy egress hardening is a 3-engineer-week project that must NOT be rushed — `default-deny-egress` silently breaks every workload that has an un-enumerated outbound dependency. The bulk of the work is discovery (7 days of baseline flow logs per namespace), not deployment. This doc captures the staged rollout plan so subsequent loop runs (or whoever picks up execution) don't redo the planning work. Covers: - Phase 0: pre-flight (CNI compat, flow log enablement). - Phase 1: discovery (per-namespace egress enumeration). - Phase 2: allowlist authoring. - Phase 3: staged rollout (1 staging → 1 prod → fan out). - Phase 4: steady-state (Kyverno schema enforcement, monthly review). Dependencies: - DEVOP-589 (Harbor proxy-cache) must land before Phase 2 or the allowlists will churn. - DEVOP-588 (Kyverno on all clusters) is a soft dep for Phase 4. This PR adds the doc only. No NetworkPolicy is deployed. Linear: https://linear.app/alloralabs/issue/DEVOP-579 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ook hook, ingress in scope Four findings from cubic addressed: 1. tickets/devop-579-network-policy-rollout.md:33 (P2) — Phase 1 discovery checklist now explicitly enumerates suspect egress destinations to flag for incident review (webhook receivers, pastebins, ngrok/tunnel services, 169.254.169.254 / cloud metadata, residential dynamic-DNS). Each flagged destination gets an owner-review gate before allowlist inclusion. 2. tickets/devop-579-network-policy-rollout.md:52 (P1) — Phase 3 staged rollout soak windows changed from 24h to the 48h spec'd by DEVOP-579, and now require a clean soak before advancing. 3. tickets/devop-579-network-policy-rollout.md:64 (P2) — Phase 4 steady-state now mandates documenting the rollout, allowlist layout, rollback command, and on-call escalation path in SECURITY-RUNBOOK.md (DEVOP-571). 4. tickets/devop-579-network-policy-rollout.md:74 (P2) — Ingress default-deny is no longer out-of-scope. Added a dedicated section laying out the parallel ingress cohort (same Phases 0–4 shape with ingress-specific discovery, allowlist patterns, slower production rollout because ingress blast-radius is higher, and Kyverno asserting both directions in Phase 4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f377778 to
7b93a18
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
tickets/devop-579-network-policy-rollout.md— a staged plan for rollingdefault-deny-egressNetworkPolicies across our 13 clusters.Why doc-first
NetworkPolicy egress hardening is a 3-engineer-week project where the bulk of effort is discovery, not deployment.
default-deny-egresssilently breaks every workload that has an un-enumerated outbound dependency, so rushing it is production-impacting. Capturing the plan now (Phases 0-4, rollback procedure, dependencies on DEVOP-588/589) means subsequent loop runs or human owners can pick up execution without redoing the planning.This PR adds only the plan document. No NetworkPolicy is deployed.
Test plan
Related
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Summary by cubic
Adds a phased plan to roll out
default-deny-egressand paralleldefault-deny-ingressKubernetes NetworkPolicies across all 13 clusters per DEVOP-579; documentation only (tickets/devop-579-network-policy-rollout.md), no policies are deployed. The plan includes 48-hour soak windows with a clean-soak gate, a suspect‑egress checklist (webhook/pastebin/ngrok/cloud‑metadata/dynamic‑DNS) for review, a documented rollback command, andSECURITY-RUNBOOK.mdupdates.Written for commit 7b93a18. Summary will update on new commits. Review in cubic