Skip to content

Add Calico Ingress Gateway to calico system networkpolicy#4581

Open
alexh-tigera wants to merge 4 commits into
tigera:masterfrom
alexh-tigera:ev-6476-add-gateway-to-calico-system-networkpolicy
Open

Add Calico Ingress Gateway to calico system networkpolicy#4581
alexh-tigera wants to merge 4 commits into
tigera:masterfrom
alexh-tigera:ev-6476-add-gateway-to-calico-system-networkpolicy

Conversation

@alexh-tigera
Copy link
Copy Markdown
Member

Description

Fixed an issue where the Calico Ingress Gateway failed to deploy when the recommended default deny policy was applied.

This adds:

  • DNS and api-server egress rules for the certgen job in order to apply the generated certificates.
  • DNS and api-server egress rules for envoy-gateway to manage envoy proxies.

Release Note

* Fixed an issue where Calico Ingress Gateway failed to deploy when default-deny rules were applied.

For PR author

  • Tests for change.
  • ~~ If changing pkg/apis/, run make gen-files~~
  • ~~ If changing versions, run make gen-versions~~

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Copy link
Copy Markdown
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, except

  • I don't think operator should unconditionally install default-deny policy for tigera-gateway namespace
  • I do think we should have an FV test that does that, and then checks CIG can be successfully deployed.

Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
objs = append(objs,
pr.gatewayCertgenAllowTigeraPolicy(),
pr.gatewayControllerAllowTigeraPolicy(),
networkpolicy.CalicoSystemDefaultDeny(common.TigeraGatewayNamespace),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the operator should unconditionally install this default deny, right? Were you only doing this for testing? Instead I think we should have an FV test that applies the default-deny policy and then verifies that CIG can still be successfully deployed.

Copy link
Copy Markdown
Member Author

@alexh-tigera alexh-tigera Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed that after removing the other changes I made. While doing that I also realized that it needs a watch for the network policies, so I've fix the default deny and added the watches in 48a0ab8.

Where do FV tests like that live? I see some similar things in felix/fv/ ? Nevermind, I thought that the test/ dir was unit tests in this repo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the one, test/*.go in this repo. Also see the fv target in the Makefile.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do LMK when the PR is ready for another review.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it took me a while to understand how to implement the FV, but I'm happy with it now. Please have a look.

@alexh-tigera alexh-tigera force-pushed the ev-6476-add-gateway-to-calico-system-networkpolicy branch 9 times, most recently from 2f7336c to 35a7a8f Compare April 15, 2026 21:26
alexh-tigera and others added 4 commits April 16, 2026 11:10
The certgen job in the tigera-gateway namespace needs DNS and
kube-apiserver egress to create TLS secrets. Under a global
default-deny policy, this traffic is blocked causing the job to
fail with i/o timeout reaching the API server.

Add a NetworkPolicy in the calico-system tier that allows the
certgen pods (app == 'certgen') DNS and kube-apiserver egress,
plus a default-deny policy for the tigera-gateway namespace,
matching the pattern used by other components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The envoy-gateway controller needs DNS and kube-apiserver egress
to function. Under a global default-deny policy, the controller
fails with i/o timeout reaching the API server.

Add a NetworkPolicy in the calico-system tier that allows the
controller pods (k8s-app == 'gateway-api-controller') DNS egress,
kube-apiserver egress, and ingress on its serving ports (xDS gRPC,
ratelimit, wasm, metrics, webhook).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add WaitToAddTierWatch and WaitToAddNetworkPolicyWatches for the
gateway controller's allow policies, matching the pattern used by
other controllers. Add tier existence check during reconciliation.

Remove the namespace-scoped default deny from tigera-gateway because
Gateway resources dynamically create Envoy proxy pods whose traffic
patterns are determined by the customer's Gateway and HTTPRoute
configuration. Keep the controller-access and certgen-access allow
policies so control plane pods still function under a customer's
global default deny.

Add FV test verifying the allow policies are created.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers the scenario the calico-system tier allow policies were added
to support: a default-tier v3 NetworkPolicy with selector all() is
applied to tigera-gateway, then a GatewayAPI CR is created. The test
asserts the envoy-gateway-certgen Job reaches Succeeded and the
envoy-gateway Deployment becomes Ready despite the default-deny,
which is only possible if gateway-api-certgen-access and
gateway-api-controller-access in the higher-precedence calico-system
tier correctly permit the traffic those workloads need.

Also exercises the v3.NetworkPolicy watch by deleting each allow
policy and asserting it's re-created, guarding against regressions
in WaitToAddNetworkPolicyWatches.

Adds calico/envoy-gateway to the FV kind cluster image load list so
the gateway Deployment's image is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexh-tigera alexh-tigera force-pushed the ev-6476-add-gateway-to-calico-system-networkpolicy branch from 5dbd768 to cba4016 Compare April 16, 2026 18:12
@alexh-tigera alexh-tigera marked this pull request as ready for review April 20, 2026 15:43
@alexh-tigera alexh-tigera requested a review from a team as a code owner April 20, 2026 15:43
@alexh-tigera alexh-tigera requested a review from nelljerram April 20, 2026 15:43
Copy link
Copy Markdown
Member

@nelljerram nelljerram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few detailed questions.

Also, after seeing this in detail, I'm now a bit worried about the big picture.

  1. Prior to this PR, I believe we do have an overall consistent position on NetworkPolicy - namely:

    • except for tigera-gateway, the operator configures default-deny for Calico namespaces, and exceptions to that that are required for Calico's own operation
    • for tigera-gateway we do not do that because Gateway deployments are also created in the tigera-namespace and we can't predict what NetworkPolicy they would need, if there was a default-deny
    • our users are encouraged to configure default-deny for their own namespace - but that is up to them.
  2. However I'm not sure that is adequately documented anywhere.

  3. This PR arguably makes it look more like a bug that we don't implement NetworkPolicy to allow specific Gateway flows.

I'm sorry I didn't spot this before, but now I'm wondering it it would be better just to add "tigera-gateway" to the set of excepted namespaces at https://docs.tigera.io/calico-enterprise/latest/network-policy/default-deny

return fmt.Errorf("gatewayapi-controller failed to watch Installation resource: %w", err)
}

go utils.WaitToAddTierWatch(networkpolicy.CalicoTierName, c, opts.K8sClientset, log, tierWatchReady)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work, given that we don't pass tierWatchReady outside of this function - so no other code can depend on whether that tier is ready yet.

go utils.WaitToAddTierWatch(networkpolicy.CalicoTierName, c, opts.K8sClientset, log, tierWatchReady)
go utils.WaitToAddNetworkPolicyWatches(c, opts.K8sClientset, log, []types.NamespacedName{
{Name: gatewayapi.GatewayControllerPolicyName, Namespace: common.TigeraGatewayNamespace},
{Name: gatewayapi.GatewayCertgenPolicyName, Namespace: common.TigeraGatewayNamespace},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment what these watches are for? IIUC they mean that Reconcile will be called once these policies exist - but why, aka what will then happen next?

}
} else {
includeV3NetworkPolicy = true
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, IIUC, instead of this check, you could add tierWatchReady into the ReconcileGatewayAPI struct and then use r.tierWatchReady.IsReady() - see policyrecommendation_controller.go for an example.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I started by using apiserver_controller.go as an example but sounds like that's the wrong approach - if we decide to make the fix in operator rather than excluding tigera-gateway from the default-deny rules, I'll look at policyrecommendation_controller.go.

}

if includeV3NetworkPolicy {
policyComponent := gatewayapi.GatewayPolicy(gatewayConfig)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a separate component, instead of being added into the Objects of GatewayAPIImplementationComponent ?

// GatewayPolicy returns a Component that renders the calico-system tier network policies
// for the gateway control plane. This is rendered separately from the main gateway component
// so that a failure to create v3 NetworkPolicy resources does not block the rest of the
// gateway resources. We intentionally do not include a namespace-scoped default deny here
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it not block? Is there a common scenario where we expect to fail to create the network policies?

})

objsToCreate, _ := policyComp.Objects()
Expect(objsToCreate).To(HaveLen(2))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you intended to add some specific OpenShift testing here.

Comment thread test/gatewayapi_test.go
// available), this one stands up the full Calico + APIServer stack so that the projectcalico.org/v3
// aggregated API becomes reachable. With v3 present, the controller should render the two
// calico-system tier allow policies (GatewayPolicy component) and keep them in place via the
// NetworkPolicy watch.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "keep them in place via the NetworkPolicy watch" mean?

Comment thread test/gatewayapi_test.go
operatorDone = createInstallation(c, mgr, shutdownContext, nil)
verifyCalicoHasDeployed(c)

By("Installing the APIServer so the projectcalico.org/v3 aggregated API becomes available")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary? I thought Casey had made the API server optional now?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this, I haven't rebased since his changes.

Comment thread test/gatewayapi_test.go

By("Deleting each NetworkPolicy and verifying the watch triggers re-creation")
// Both policies are registered in the same WaitToAddNetworkPolicyWatches call; exercising
// each one guards against a future refactor accidentally dropping an entry from that list.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this answers one of my earlier answers. However, what does "against a future refactor" mean? How could we distinguish between an accidental drop and an intended drop? And won't it be the future version of the code running, not this version, anyway?

Once this is clarified, please also explain next to the WaitToAdd ... calls.

@alexh-tigera
Copy link
Copy Markdown
Member Author

I'm sorry I didn't spot this before, but now I'm wondering it it would be better just to add "tigera-gateway" to the set of excepted namespaces at https://docs.tigera.io/calico-enterprise/latest/network-policy/default-deny

Would that be too permissive to exclude the tigera-gateway ns from default deny? If the operator controls the permissions, we can limit it to DNS + kube-apiserver access. IIRC the concern was that excluding tigera-gateway was too open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants