Add Calico Ingress Gateway to calico system networkpolicy#4581
Add Calico Ingress Gateway to calico system networkpolicy#4581alexh-tigera wants to merge 4 commits into
Conversation
nelljerram
left a comment
There was a problem hiding this comment.
Looking good, except
- I don't think operator should unconditionally install default-deny policy for tigera-gateway namespace
- I do think we should have an FV test that does that, and then checks CIG can be successfully deployed.
| objs = append(objs, | ||
| pr.gatewayCertgenAllowTigeraPolicy(), | ||
| pr.gatewayControllerAllowTigeraPolicy(), | ||
| networkpolicy.CalicoSystemDefaultDeny(common.TigeraGatewayNamespace), |
There was a problem hiding this comment.
I don't think the operator should unconditionally install this default deny, right? Were you only doing this for testing? Instead I think we should have an FV test that applies the default-deny policy and then verifies that CIG can still be successfully deployed.
There was a problem hiding this comment.
Yes, I missed that after removing the other changes I made. While doing that I also realized that it needs a watch for the network policies, so I've fix the default deny and added the watches in 48a0ab8.
Where do FV tests like that live? I see some similar things in Nevermind, I thought that the felix/fv/ ?test/ dir was unit tests in this repo.
There was a problem hiding this comment.
Yes, that's the one, test/*.go in this repo. Also see the fv target in the Makefile.
There was a problem hiding this comment.
Please do LMK when the PR is ready for another review.
There was a problem hiding this comment.
Ok, it took me a while to understand how to implement the FV, but I'm happy with it now. Please have a look.
2f7336c to
35a7a8f
Compare
The certgen job in the tigera-gateway namespace needs DNS and kube-apiserver egress to create TLS secrets. Under a global default-deny policy, this traffic is blocked causing the job to fail with i/o timeout reaching the API server. Add a NetworkPolicy in the calico-system tier that allows the certgen pods (app == 'certgen') DNS and kube-apiserver egress, plus a default-deny policy for the tigera-gateway namespace, matching the pattern used by other components. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The envoy-gateway controller needs DNS and kube-apiserver egress to function. Under a global default-deny policy, the controller fails with i/o timeout reaching the API server. Add a NetworkPolicy in the calico-system tier that allows the controller pods (k8s-app == 'gateway-api-controller') DNS egress, kube-apiserver egress, and ingress on its serving ports (xDS gRPC, ratelimit, wasm, metrics, webhook). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add WaitToAddTierWatch and WaitToAddNetworkPolicyWatches for the gateway controller's allow policies, matching the pattern used by other controllers. Add tier existence check during reconciliation. Remove the namespace-scoped default deny from tigera-gateway because Gateway resources dynamically create Envoy proxy pods whose traffic patterns are determined by the customer's Gateway and HTTPRoute configuration. Keep the controller-access and certgen-access allow policies so control plane pods still function under a customer's global default deny. Add FV test verifying the allow policies are created. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers the scenario the calico-system tier allow policies were added to support: a default-tier v3 NetworkPolicy with selector all() is applied to tigera-gateway, then a GatewayAPI CR is created. The test asserts the envoy-gateway-certgen Job reaches Succeeded and the envoy-gateway Deployment becomes Ready despite the default-deny, which is only possible if gateway-api-certgen-access and gateway-api-controller-access in the higher-precedence calico-system tier correctly permit the traffic those workloads need. Also exercises the v3.NetworkPolicy watch by deleting each allow policy and asserting it's re-created, guarding against regressions in WaitToAddNetworkPolicyWatches. Adds calico/envoy-gateway to the FV kind cluster image load list so the gateway Deployment's image is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5dbd768 to
cba4016
Compare
nelljerram
left a comment
There was a problem hiding this comment.
A few detailed questions.
Also, after seeing this in detail, I'm now a bit worried about the big picture.
-
Prior to this PR, I believe we do have an overall consistent position on NetworkPolicy - namely:
- except for tigera-gateway, the operator configures default-deny for Calico namespaces, and exceptions to that that are required for Calico's own operation
- for tigera-gateway we do not do that because Gateway deployments are also created in the tigera-namespace and we can't predict what NetworkPolicy they would need, if there was a default-deny
- our users are encouraged to configure default-deny for their own namespace - but that is up to them.
-
However I'm not sure that is adequately documented anywhere.
-
This PR arguably makes it look more like a bug that we don't implement NetworkPolicy to allow specific Gateway flows.
I'm sorry I didn't spot this before, but now I'm wondering it it would be better just to add "tigera-gateway" to the set of excepted namespaces at https://docs.tigera.io/calico-enterprise/latest/network-policy/default-deny
| return fmt.Errorf("gatewayapi-controller failed to watch Installation resource: %w", err) | ||
| } | ||
|
|
||
| go utils.WaitToAddTierWatch(networkpolicy.CalicoTierName, c, opts.K8sClientset, log, tierWatchReady) |
There was a problem hiding this comment.
How does this work, given that we don't pass tierWatchReady outside of this function - so no other code can depend on whether that tier is ready yet.
| go utils.WaitToAddTierWatch(networkpolicy.CalicoTierName, c, opts.K8sClientset, log, tierWatchReady) | ||
| go utils.WaitToAddNetworkPolicyWatches(c, opts.K8sClientset, log, []types.NamespacedName{ | ||
| {Name: gatewayapi.GatewayControllerPolicyName, Namespace: common.TigeraGatewayNamespace}, | ||
| {Name: gatewayapi.GatewayCertgenPolicyName, Namespace: common.TigeraGatewayNamespace}, |
There was a problem hiding this comment.
Could you comment what these watches are for? IIUC they mean that Reconcile will be called once these policies exist - but why, aka what will then happen next?
| } | ||
| } else { | ||
| includeV3NetworkPolicy = true | ||
| } |
There was a problem hiding this comment.
So, IIUC, instead of this check, you could add tierWatchReady into the ReconcileGatewayAPI struct and then use r.tierWatchReady.IsReady() - see policyrecommendation_controller.go for an example.
There was a problem hiding this comment.
👍 I started by using apiserver_controller.go as an example but sounds like that's the wrong approach - if we decide to make the fix in operator rather than excluding tigera-gateway from the default-deny rules, I'll look at policyrecommendation_controller.go.
| } | ||
|
|
||
| if includeV3NetworkPolicy { | ||
| policyComponent := gatewayapi.GatewayPolicy(gatewayConfig) |
There was a problem hiding this comment.
Does this need to be a separate component, instead of being added into the Objects of GatewayAPIImplementationComponent ?
| // GatewayPolicy returns a Component that renders the calico-system tier network policies | ||
| // for the gateway control plane. This is rendered separately from the main gateway component | ||
| // so that a failure to create v3 NetworkPolicy resources does not block the rest of the | ||
| // gateway resources. We intentionally do not include a namespace-scoped default deny here |
There was a problem hiding this comment.
Why should it not block? Is there a common scenario where we expect to fail to create the network policies?
| }) | ||
|
|
||
| objsToCreate, _ := policyComp.Objects() | ||
| Expect(objsToCreate).To(HaveLen(2)) |
There was a problem hiding this comment.
It sounds like you intended to add some specific OpenShift testing here.
| // available), this one stands up the full Calico + APIServer stack so that the projectcalico.org/v3 | ||
| // aggregated API becomes reachable. With v3 present, the controller should render the two | ||
| // calico-system tier allow policies (GatewayPolicy component) and keep them in place via the | ||
| // NetworkPolicy watch. |
There was a problem hiding this comment.
What does "keep them in place via the NetworkPolicy watch" mean?
| operatorDone = createInstallation(c, mgr, shutdownContext, nil) | ||
| verifyCalicoHasDeployed(c) | ||
|
|
||
| By("Installing the APIServer so the projectcalico.org/v3 aggregated API becomes available") |
There was a problem hiding this comment.
Is this still necessary? I thought Casey had made the API server optional now?
There was a problem hiding this comment.
I'll look into this, I haven't rebased since his changes.
|
|
||
| By("Deleting each NetworkPolicy and verifying the watch triggers re-creation") | ||
| // Both policies are registered in the same WaitToAddNetworkPolicyWatches call; exercising | ||
| // each one guards against a future refactor accidentally dropping an entry from that list. |
There was a problem hiding this comment.
OK, this answers one of my earlier answers. However, what does "against a future refactor" mean? How could we distinguish between an accidental drop and an intended drop? And won't it be the future version of the code running, not this version, anyway?
Once this is clarified, please also explain next to the WaitToAdd ... calls.
Would that be too permissive to exclude the tigera-gateway ns from default deny? If the operator controls the permissions, we can limit it to DNS + kube-apiserver access. IIRC the concern was that excluding tigera-gateway was too open? |
Description
Fixed an issue where the Calico Ingress Gateway failed to deploy when the recommended default deny policy was applied.
This adds:
Release Note
For PR author
make gen-files~~make gen-versions~~For PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.