Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
09a3171
[WIP] Add support for machine preservation through annotations
thiyyakat Oct 29, 2025
01ea111
Add MachinePreserveTimeout to SafetyOptions.
thiyyakat Nov 5, 2025
3cc48c9
Add PreserveExpiryTime to `machine.Status.CurrentStatus`.
thiyyakat Nov 5, 2025
456876a
Remove `AutoPreserveFailedMachineCount` from machine set
thiyyakat Nov 5, 2025
824aee8
Fix linting error
thiyyakat Nov 5, 2025
dc64d8f
Add generated files
thiyyakat Nov 5, 2025
63a798e
Add support for preserve=now on node and machine objects
thiyyakat Nov 5, 2025
5b16fbc
Update TODOs
thiyyakat Nov 5, 2025
995cbd0
[WIP] Implement add/remove/update of node and machine annotations
thiyyakat Nov 10, 2025
82995e4
Update preserve logic to honour node annotations over machine
thiyyakat Nov 13, 2025
98a9ad3
Add preservation logic in machineset controller. TODO: remove debug logs
thiyyakat Nov 19, 2025
aec64cf
Add drain logic post preservation of failed machine
thiyyakat Nov 19, 2025
e93ad0c
Fix return for reconcileMachineHealth. Unit tests passing
thiyyakat Nov 19, 2025
bb5786b
Update CRDs
thiyyakat Nov 19, 2025
bdbf811
Fix bug causing repeated requeuing
thiyyakat Nov 24, 2025
dd49c4d
Fix drain logic in machine preservation for Unknown->Failed case:
thiyyakat Nov 26, 2025
dbaa554
Fix toggle between now and when-failed when machine has not failed.
thiyyakat Nov 27, 2025
58cf017
Refactor changes to support auto-preservation of failed machines
thiyyakat Dec 4, 2025
d703562
Fix bugs that prevented MCS update, and auto-preservation of machines
thiyyakat Dec 5, 2025
66c85ff
Add support for uncordoning preserved node that is healthy
thiyyakat Dec 8, 2025
0e9db7f
Refactor code:
thiyyakat Dec 10, 2025
98ba495
Fix bug so that recovered preserved nodes are uncordoned
thiyyakat Dec 10, 2025
f598ccb
Minor changes
thiyyakat Dec 10, 2025
7adb3d0
Change verb used in log statements for machine/node name
thiyyakat Dec 10, 2025
e2c16ed
Fix mistake made during rebasing
thiyyakat Dec 10, 2025
164bae3
Change return types of preservation util functions such that only cal…
thiyyakat Dec 11, 2025
69a5040
Address review comments
thiyyakat Dec 12, 2025
92202c0
Remove incorrect json tag and regenerate CRDs.
thiyyakat Dec 18, 2025
3802a8d
Apply suggestions from code review - part 1
thiyyakat Dec 19, 2025
3314c23
Delete invalid gitlink
thiyyakat Dec 19, 2025
fe3464e
Address review comments- part 2:
thiyyakat Dec 22, 2025
fb0975e
Address review comments- part 3:
thiyyakat Dec 23, 2025
0616fd0
Address review comments- part 4:
thiyyakat Dec 23, 2025
3b97b81
Add unit tests for preservation logic in machine.go
thiyyakat Dec 24, 2025
0d8ed79
Refactor tests to reduce redundancy in code.
thiyyakat Dec 26, 2025
1995a90
Add tests for preservation logic in machine_util.go
thiyyakat Dec 29, 2025
b7856af
Refactor test code to reduce redundant code
thiyyakat Dec 31, 2025
22fecfb
Fix bugs after merging
thiyyakat Dec 31, 2025
8e6d11b
Remove testing code
thiyyakat Jan 6, 2026
bec6ea5
Address review comments - part 5: Change api fields to pointers
thiyyakat Jan 8, 2026
59f0f84
Fix Makefile
thiyyakat Jan 8, 2026
5ca583e
Add crds
thiyyakat Jan 8, 2026
fee5ebc
Address review comments - part 6: Replace function preserveExpiryTime…
thiyyakat Jan 8, 2026
4518eea
Address review comments - part 7:
thiyyakat Jan 13, 2026
87c8a7e
Address review comments - part 7:
thiyyakat Jan 13, 2026
5170a38
Fix apis.md
thiyyakat Jan 14, 2026
4f2d0a1
Fix apis.md and address review comments
thiyyakat Jan 14, 2026
71806b9
Modify nodeops.AddOrUpdateConditionsOnNode() to return updated node
thiyyakat Jan 16, 2026
b3b5749
Address review comments - part 8:
thiyyakat Jan 16, 2026
d795ae8
Handle auto-preserved case similar to when-failed case
thiyyakat Jan 16, 2026
177c681
Fix bugs, incorporate design change for when-failed, and add tests
thiyyakat Jan 16, 2026
31a6c6c
Revert Makefile changes
thiyyakat Jan 16, 2026
dafebb5
Add preservation tests for machineSet controller
thiyyakat Jan 19, 2026
345ee03
Update comments and fix minor bugs
thiyyakat Jan 19, 2026
adfa2e2
Address review comments - part 9
thiyyakat Jan 20, 2026
68d201b
Address review comments - part 10: Remove unnecessary nil checks whil…
thiyyakat Jan 21, 2026
4a8141a
Add machine-preserve-timeout flag
thiyyakat Jan 21, 2026
b1528fe
Address review comments - part 11:
thiyyakat Jan 22, 2026
af84f45
Address review comments - part 12:
thiyyakat Jan 23, 2026
6c94006
Handle edge cases:
thiyyakat Jan 23, 2026
499950d
Ensure reconcileClusterMachineSafetyAPIServer does not overwrite Pres…
thiyyakat Feb 3, 2026
8809343
Remove PreserveMachineAnnotationValuePreserveStoppedByMCM annotation …
thiyyakat Feb 10, 2026
872fc9c
Add usage doc for preservation feature
thiyyakat Feb 10, 2026
d211a66
Make changes to simplify design:
thiyyakat Feb 10, 2026
562857c
Add code to reconcile auto preservation, and reduce number of auto-pr…
thiyyakat Feb 12, 2026
a65fe28
Modify annotation handling to improve determinism: Introduced lastApp…
thiyyakat Feb 12, 2026
809b1bc
Update tests and handle edge cases
thiyyakat Feb 16, 2026
c2f82b4
Fix bugs introduced by latest changes
thiyyakat Feb 18, 2026
7a97d9c
Sync MCD's value of AutoPreserveFailedMachineMax to MCS on change
thiyyakat Feb 18, 2026
cddcaf5
Change proposal to reflect changes in design
thiyyakat Feb 20, 2026
4ec840a
Update usage doc.
thiyyakat Feb 20, 2026
9b9b792
Clean up comments
thiyyakat Feb 24, 2026
9c9079a
Clean up manageMachinePreservation
thiyyakat Feb 26, 2026
5d9d4e3
Address review comments
thiyyakat Mar 2, 2026
71b5d9f
Address review comments given by
thiyyakat Mar 3, 2026
c53ce27
Address review comments given by
thiyyakat Mar 6, 2026
c2ed3d2
Address review comments by
thiyyakat Mar 11, 2026
e6eef3a
Address review comments by
thiyyakat Mar 11, 2026
29f6992
Fix tests after rebasing
thiyyakat Apr 2, 2026
c32c35d
Fix bug introduced in manageReplicas while rebasing
thiyyakat Apr 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/documents/apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
by default, which is treated as infinite deadline.</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
</td>
</tr>
</table>
</td>
</tr>
Expand Down Expand Up @@ -678,6 +693,19 @@ int32
<em>(Optional)</em>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
</td>
</tr>
</table>
</td>
</tr>
Expand Down Expand Up @@ -833,6 +861,21 @@ Kubernetes meta/v1.Time
<p>Last update time of current status</p>
</td>
</tr>
<tr>
<td>
<code>preserveExpiryTime</code>
</td>
<td>
<em>
<a href="https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#time-v1-meta">
Kubernetes meta/v1.Time
</a>
</em>
</td>
<td>
<p>PreserveExpiryTime is the time at which MCM will stop preserving the machine</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1071,6 +1114,22 @@ Kubernetes meta/v1.Duration
</tr>
<tr>
<td>
<code>machinePreserveTimeout</code>
</td>
<td>
<em>
<a href="https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#Duration">
Kubernetes meta/v1.Duration
</a>
</em>
</td>
<td>
<em>(Optional)</em>
<p>MachinePreserveTimeout is the timeout after which the machine preservation is stopped</p>
</td>
</tr>
<tr>
<td>
<code>disableHealthTimeout</code>
</td>
<td>
Expand Down Expand Up @@ -1398,6 +1457,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
by default, which is treated as infinite deadline.</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1860,6 +1934,19 @@ int32
<em>(Optional)</em>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1998,6 +2085,20 @@ LastOperation
<p>FailedMachines has summary of machines on which lastOperation Failed</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineCount</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>AutoPreserveFailedMachineCount has a count of the number of failed machines in the machineset that are currently auto-preserved</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down
79 changes: 39 additions & 40 deletions docs/proposals/machine-preservation.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,18 @@ Related Issue: https://github.com/gardener/machine-controller-manager/issues/100
## Proposal

In order to achieve the objectives mentioned, the following are proposed:
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved,
and the time duration for which these machines will be preserved.
```
machineControllerManager:
autoPreserveFailedMax: 0
machinePreserveTimeout: 72h
```
* This configuration will be set per worker pool.
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMax` will be distributed across N machine deployments.
* `autoPreserveFailedMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
* Example: if `autoPreserveFailedMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
1. Enhance `worker` configuration in the `ShootSpec`, to specify the maximum number of failed machines that will be auto-preserved and the time duration for which machines will be preserved.
```
workers:
- name: example-worker
autoPreserveFailedMachineMax: 2
machineControllerManager:
machinePreserveTimeout: 72h
```
* This configuration will be set per worker pool.
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMachineMax` will be distributed across N machine deployments.
* `autoPreserveFailedMachineMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
* Example: if `autoPreserveFailedMachineMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM.
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`.
4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place:
Expand All @@ -49,29 +50,28 @@ and the time duration for which these machines will be preserved.
- After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The machine phase is changed to `Running` and the CA may delete the node.
- If a machine in `Running:Preserved` fails, it is moved to `Failed:Preserved`.
5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place:
- The machine is drained of pods except for Daemonset pods.
- Pods (other than DaemonSet pods) are drained.
- The machine phase is changed to `Failed:Preserved`.
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotations `node.machine.sapcloud.io/preserve=when-failed` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached:
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMachineMax` is not breached:
- Pods (other than DaemonSet pods) are drained.
- The machine's phase is changed to `Failed:Preserved`.
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotation `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`.
7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running:Preserved`. After the timeout, it will be moved to `Running`.
The rationale behind moving the machine to `Running:Preserved` rather than `Running`, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization.
8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`.
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMachineMax`.
.
7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase by deleting the annotation: `node.machine.sapcloud.io/preserve`.
* MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved.
9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`.
8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
9. MCM will be modified to perform drain in `Failed` phase for preserved machines.

## State Diagrams:

1. State Diagram for when a machine or its node is explicitly annotated for preservation:
```mermaid
```mermaid
stateDiagram-v2
state "Running" as R
state "Running + Requested" as RR
Expand All @@ -86,27 +86,26 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
RR --> F: on failure
F --> FP
FP --> T: on timeout or preserve=false
FP --> RP: if node Healthy before timeout
FP --> R: if node Healthy before timeout
T --> [*]
R-->RP: annotated with preserve=now
RP-->F: if node/VM not healthy
```
```

2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation):
```mermaid
stateDiagram-v2
state "Running" as R
state "Running:Preserved" as RP
state "Failed
(node drained)" as F
state "Failed:Preserved" as FP
state "Terminating" as T
[*] --> R
R-->F: on failure
F --> FP: if autoPreserveFailedMax not breached
F --> T: if autoPreserveFailedMax breached
F --> FP: if autoPreserveFailedMachineMax not breached
F --> T: if autoPreserveFailedMachineMax breached
FP --> T: on timeout or value=false
FP --> RP : if node Healthy before timeout
RP --> R: on timeout or preserve=false
FP --> R : if node Healthy before timeout
T --> [*]
```

Expand All @@ -128,21 +127,22 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
4. Operator analyzes the VM.


### Use Case 3: Auto-Preservation
### Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery
**Scenario:** Machine fails unexpectedly, no prior annotation.
#### Steps:
1. Machine transitions to `Failed` phase.
2. Machine is drained.
3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
3. If `autoPreserveFailedMachineMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
4. After `machinePreserveTimeout`, machine is terminated by MCM.
5. If machine is brought back to `Running` phase before timeout, pods can be scheduled on it again.

### Use Case 4: Early Release
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved.
#### Steps:
1. Machine is in `Running:Preserved` or `Failed:Preserved` phase.
2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
2. Operator removes `node.machine.sapcloud.io/preserve` from node.
3. MCM transitions machine to `Running` or `Terminating` for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired.
4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation.
4. If machine was auto-preserved, capacity becomes available for auto-preservation.

## Points to Note

Expand All @@ -151,13 +151,12 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
3. Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
4. Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
5. If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object.
7. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value.
7. If `autoPreserveFailedMachineMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
8. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured.
9. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM.
10. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
11. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified
9. On increase/decrease of `machinePreserveTimeout`, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
10. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified
Loading
Loading