Skip to content

Enhancement: Add drift detection and automatic reconciliation#668

Open
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal
Open

Enhancement: Add drift detection and automatic reconciliation#668
eshulman2 wants to merge 1 commit intok-orc:mainfrom
eshulman2:drift-d-proposal

Conversation

@eshulman2
Copy link
Contributor

Proposal for drift detection feature.

@github-actions github-actions bot added the semver:patch No API change label Feb 3, 2026
Copy link
Collaborator

@mandre mandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of the code needs changing? I expect we detail how shouldReconcile changes.


1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
3. The new resource ID is stored in `status.id`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question wasn't really about the obvious case where you'd delete out of band a resource for which you have a weak dependency, but for where we had hard dependency. An example with Subnet -> Network would be more telling. What happens if a network is re-created? Will the subnet be recreated as well?

Inconsistent states can happen, and will happen. Someone who force-deleted a resource, a bug in OpenStack, an operator who made changes to the database directly... We should explain what we would do whenever we will have that case.

This probably deserves a separate section.

@eshulman2 eshulman2 requested a review from mandre March 11, 2026 14:22
@eshulman2 eshulman2 requested a review from mandre March 16, 2026 11:13
@eshulman2 eshulman2 requested a review from mandre March 19, 2026 10:25
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but looks very reasonable to me.


This ensures the controller automatically triggers reconciliation after the configured period. Controller-runtime's work queue handles deduplication of reconcile requests, so no additional time-based checks are needed in `shouldReconcile`.

**Resources in terminal error are not resynced**: When a resource is in a terminal error state (e.g., invalid configuration, unrecoverable OpenStack error), periodic resync is not scheduled. Terminal errors indicate issues that cannot be resolved through automatic retry and require manual intervention to fix the underlying problem. This prevents wasted reconciliation cycles on resources that are known to be in an unrecoverable state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory definitely needs a refresh, but do we sometimes set a terminal state due to an issue that can't be resolved automatically but might be resolved manually by an administrator? e.g. A resource is in ERROR state which is later cleared after admin fixes some underlying issue. If so, it might be worth including terminal states in any resync process.


1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
3. The new resource ID is stored in `status.id`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will need careful thought per resource. In the specific case of network and subnet, IIRC a subnet cannot exist without a network, so if the network has been deleted then the subnet has also been deleted. From this POV, 'hard' dependencies may actually be the easy case.


#### Implementation Details

At the end of a successful reconciliation (when no other reschedule is pending), the controller schedules the next resync:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to do on controller restart?


For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource.

**Note on dependent resources**: OpenStack enforces referential integrity for most resources (e.g., Networks cannot be deleted while Subnets exist). If resources are deleted through means that bypass these checks (direct database manipulation, OpenStack bugs), drift detection preserves ORC's existing reconciliation behavior:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ORC's behaviour is undefined in the presence of bugs and direct DB manipulation, tbh. I'm not sure how you could define it.

**Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten.

**Mitigation**:
- Document that ORC should be the sole manager of resources it creates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already do this?


**Mitigation**:
- Disabled by default; when enabled, recommend conservative intervals (e.g., 10 hours)
- Add random jitter to resync times to avoid thundering herd: since reconciliation already uses "requeue after X duration", jitter simply adds a random offset (e.g., ±10%) to the resync period, spreading resyncs over time rather than having them fire simultaneously
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it all gets reset when the controller restarts. You might want to store a last synced time in the status? Or some other solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver:patch No API change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants