OCPBUGS-77254: fix(globalps): watch Machine updates to fix node labeling race condition#7803
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@judexzhu: This pull request references Jira Issue OCPBUGS-77254, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a watch on Cluster API Machines that enqueues reconciliation when a Machine's Status.NodeRef transitions from nil to non-nil, and expands unit tests to cover Machine NodeRef update predicates and skipping machines with nil NodeRef. Changes
Sequence Diagram(s)sequenceDiagram
participant Machine as Machine (CAPI)
participant APIServer as API Server
participant Controller as globalps Controller
participant Node as Node
rect rgba(200,200,255,0.5)
Machine->>APIServer: Update Status.NodeRef (nil -> non-nil)
end
rect rgba(200,255,200,0.5)
APIServer->>Controller: Watch triggers (mapped enqueue)
Controller->>APIServer: Request Machine by NamespacedName
Controller->>Node: Label node referenced by Machine.Status.NodeRef
Node-->>Controller: Confirm labels applied
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
|
why are we labeling machines and let it propagate? why don't label nodes directly? |
|
Hi @enxebre . I have no context why I think your idea is a better long-term design, (correct me if I understand wrong), when nodepool created, NodePool controller check if the upgrade Replace strategy, if true, directly add the label into the node template. However this looks might need update changes the labeling mechanism cross multiple controller/operators and touches code paths used by every NodePool. I think if could, we could open another story for follow up architecture design improvement |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng, judexzhu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @jparrill |
jparrill
left a comment
There was a problem hiding this comment.
Thanks for the PR!. I've Dropped some comments.
The globalps controller had a race condition where the Node CREATE event fired before CAPI set Machine.Status.NodeRef on the management cluster. This caused labelNodesForGlobalPullSecret to skip the new node because it could not map the Machine to the Node. With no retry mechanism, the node remained unlabeled until an unrelated event triggered a reconcile, resulting in a ~15-minute delay before the global-pull-secret-syncer DaemonSet pod was scheduled. Add a watch on capiv1.Machine via the CP cluster cache that triggers reconciliation when Machine.Status.NodeRef transitions from nil to non-nil, ensuring the node is labeled promptly. Fixes: https://issues.redhat.com/browse/OCPBUGS-77254 Signed-off-by: Jude Zhu <judzhu@redhat.com> Assisted-by: claude-4.6-opus (via Cursor) Co-authored-by: Cursor <cursoragent@cursor.com>
|
@judexzhu: This pull request references Jira Issue OCPBUGS-77254, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@judexzhu: This pull request references Jira Issue OCPBUGS-77254, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@judexzhu: This pull request references Jira Issue OCPBUGS-77254, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
Scheduling tests matching the |
|
/retest-required |
|
/test e2e-aks |
1 similar comment
|
/test e2e-aks |
|
/verified by @gaol |
|
@gaol: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@judexzhu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@judexzhu: Jira Issue Verification Checks: Jira Issue OCPBUGS-77254 Jira Issue OCPBUGS-77254 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-03-07-091925 |
|
/jira backport release-4.21, release-4.20 |
|
@gaol: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #7882 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #7883 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
global-pull-secret-syncerDaemonSet pod on newly created nodescapiv1.Machineupdates on the management cluster, triggering reconciliation whenMachine.Status.NodeReftransitions from nil to non-nilMachine.Status.NodeRefare correctly skipped during labelingProblem
The globalps controller had a race condition: when a new Node was created on the hosted cluster, the controller received the Node CREATE event and ran
labelNodesForGlobalPullSecret(). However, the CAPI Machine controller on the management cluster had not yet setMachine.Status.NodeRef, so the new node could not be mapped to its Machine and was skipped for labeling withhypershift.openshift.io/nodepool-globalps-enabled: "true".With no Machine watch, no Node update handling, and no
RequeueAfter, the node remained unlabeled until an unrelated event (e.g., a kube-system Secret change) triggered a reconcile -- often ~15 minutes later.Solution
Watch
capiv1.Machineon the CP cluster cache with a predicate that only fires whenStatus.NodeReftransitions from nil to non-nil. This ensures the controller reconciles at the exact moment CAPI links the Machine to the Node, solabelNodesForGlobalPullSecretcan label the node and the DaemonSet pod is scheduled promptly.Test plan
make lint-fixpasses with 0 issuesFixes
Made with Cursor
Summary by CodeRabbit
Tests
Improvements