fix(controller): clean stale status entries during node deletion reco… #258
fix(controller): clean stale status entries during node deletion reco… #258SeeyaVhora wants to merge 1 commit into
Conversation
✅ Deploy Preview for node-readiness-controller canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SeeyaVhora The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
|
Welcome @SeeyaVhora! |
|
Hi @SeeyaVhora. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cc @ajaysundark /ok-to-test |
|
@SeeyaVhora Can you start by creating an issue describing in detail how to reproduce the symptoms you aim to fix here? |
Thank you for your response @ajaysundark Please let me know if any additional details or adjustments would be helpful. |
|
hello @ajaysundark @mrunalp @SergeyKanzhelev Please look into this PR and review it. |
|
This seems similar to #204, @rawadhossain- would you have time for a first pass review at this? |
|
@SeeyaVhora Could you help address the failing tests? |
|
/cc @rawadhossain |
|
@ajaysundark: GitHub didn't allow me to request PR reviews from the following users: rawadhossain. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @rawadhossain |
|
@ajaysundark: GitHub didn't allow me to assign the following users: rawadhossain. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Thanks @ajaysundark, I can take a first pass look at this. I’ll review it shortly. |
Summary
This PR fixes a stale status lifecycle issue during node deletion reconciliation in the
NodeReadinessRulecontroller.Previously, deleted nodes could remain temporarily or indefinitely persisted inside
NodeEvaluationsandFailedNodesstatus fields due to reconciliation ordering and cleanup semantics.The controller would:
This created:
FailedNodes.Root Cause
cleanupDeletedNodes()executed afterupdateRuleStatus()and independently patched the API using its own retry loop.As a result:
FailedNodesentries for deleted nodes were never cleaned consistently.Changes
Reconciliation Ordering Fix
Reordered reconciliation flow so
cleanupDeletedNodes()executes beforeupdateRuleStatus().This ensures the single authoritative status patch never contains stale deleted-node entries.
In-Memory Status Cleanup
Refactored
cleanupDeletedNodes()to mutate the in-memoryrule.Statusstate directly instead of performing an independent GET/PATCH retry loop.Persistence is now fully owned by
updateRuleStatus().Lifecycle Consistency
Extended deleted-node cleanup semantics to
FailedNodesin addition toNodeEvaluations, ensuring both status fields remain consistent during node churn and autoscaling scenarios.Regression Coverage
Added deterministic regression coverage validating that deleted nodes are absent from the synchronous post-reconcile status state.
Before vs After
cleanupDeletedNodes()behaviorFailedNodescleanupImpact