Enhancement: Add drift detection and automatic reconciliation#668
Enhancement: Add drift detection and automatic reconciliation#668eshulman2 wants to merge 1 commit intok-orc:mainfrom
Conversation
mandre
left a comment
There was a problem hiding this comment.
What part of the code needs changing? I expect we detail how shouldReconcile changes.
3134799 to
a9f9abf
Compare
|
|
||
| 1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id` | ||
| 2. If not found and the resource was originally created by ORC (not imported), ORC recreates it | ||
| 3. The new resource ID is stored in `status.id` |
There was a problem hiding this comment.
My question wasn't really about the obvious case where you'd delete out of band a resource for which you have a weak dependency, but for where we had hard dependency. An example with Subnet -> Network would be more telling. What happens if a network is re-created? Will the subnet be recreated as well?
Inconsistent states can happen, and will happen. Someone who force-deleted a resource, a bug in OpenStack, an operator who made changes to the database directly... We should explain what we would do whenever we will have that case.
This probably deserves a separate section.
a9f9abf to
eda8a6f
Compare
eda8a6f to
c610a4f
Compare
c610a4f to
0fa5f18
Compare
Proposal for drift detection feature.
0fa5f18 to
ab42db9
Compare
mdbooth
left a comment
There was a problem hiding this comment.
A few comments, but looks very reasonable to me.
|
|
||
| This ensures the controller automatically triggers reconciliation after the configured period. Controller-runtime's work queue handles deduplication of reconcile requests, so no additional time-based checks are needed in `shouldReconcile`. | ||
|
|
||
| **Resources in terminal error are not resynced**: When a resource is in a terminal error state (e.g., invalid configuration, unrecoverable OpenStack error), periodic resync is not scheduled. Terminal errors indicate issues that cannot be resolved through automatic retry and require manual intervention to fix the underlying problem. This prevents wasted reconciliation cycles on resources that are known to be in an unrecoverable state. |
There was a problem hiding this comment.
My memory definitely needs a refresh, but do we sometimes set a terminal state due to an issue that can't be resolved automatically but might be resolved manually by an administrator? e.g. A resource is in ERROR state which is later cleared after admin fixes some underlying issue. If so, it might be worth including terminal states in any resync process.
|
|
||
| 1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id` | ||
| 2. If not found and the resource was originally created by ORC (not imported), ORC recreates it | ||
| 3. The new resource ID is stored in `status.id` |
There was a problem hiding this comment.
I suspect this will need careful thought per resource. In the specific case of network and subnet, IIRC a subnet cannot exist without a network, so if the network has been deleted then the subnet has also been deleted. From this POV, 'hard' dependencies may actually be the easy case.
|
|
||
| #### Implementation Details | ||
|
|
||
| At the end of a successful reconciliation (when no other reschedule is pending), the controller schedules the next resync: |
There was a problem hiding this comment.
What do you want to do on controller restart?
|
|
||
| For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource. | ||
|
|
||
| **Note on dependent resources**: OpenStack enforces referential integrity for most resources (e.g., Networks cannot be deleted while Subnets exist). If resources are deleted through means that bypass these checks (direct database manipulation, OpenStack bugs), drift detection preserves ORC's existing reconciliation behavior: |
There was a problem hiding this comment.
I think ORC's behaviour is undefined in the presence of bugs and direct DB manipulation, tbh. I'm not sure how you could define it.
| **Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten. | ||
|
|
||
| **Mitigation**: | ||
| - Document that ORC should be the sole manager of resources it creates |
There was a problem hiding this comment.
I think we already do this?
|
|
||
| **Mitigation**: | ||
| - Disabled by default; when enabled, recommend conservative intervals (e.g., 10 hours) | ||
| - Add random jitter to resync times to avoid thundering herd: since reconciliation already uses "requeue after X duration", jitter simply adds a random offset (e.g., ±10%) to the resync period, spreading resyncs over time rather than having them fire simultaneously |
There was a problem hiding this comment.
Unless it all gets reset when the controller restarts. You might want to store a last synced time in the status? Or some other solution.
Proposal for drift detection feature.