-
Notifications
You must be signed in to change notification settings - Fork 66
TQ: Support for ZFS Key Rotation #9737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
plaidfinch
wants to merge
3
commits into
main
Choose a base branch
from
tq-zfs-key-rotation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+454
−6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When Trust Quorum commits a new epoch, all encrypted U.2 datasets need
their encryption keys rotated. This change implements that flow:
- trust-quorum: Add watch channel to broadcast committed epoch changes
from NodeTask to subscribers
- sled-agent: Wire committed_epoch_rx to the config reconciler
- config-reconciler:
- Add KeyRotationError, RekeyRequest types for the rekey API
- Add rekey_datasets() batch operation on DatasetTaskHandle
- Add datasets_rekey() to DatasetTask for serialized key rotation
- Add rekey_for_epoch() to OmicronDatasets to coordinate rekeying
all managed disks when an epoch is committed
- Handle epoch change notifications in the reconciler run loop
- Add managed_disks() iterator to ExternalDisks
- illumos-utils:
- Add Zfs::change_key() using zfs-atomic-change-key crate
- Add ChangeKeyError type
- Add epoch field to DatasetProperties and include oxide:epoch in
ZFS property queries
- key-manager: Add Debug derives to key types
The rekey operation is idempotent: datasets already at the target epoch
are skipped. On startup, we process the initial epoch to catch any
missed rekeys from crashes.
Fixes #9587
plaidfinch
commented
Jan 28, 2026
Comment on lines
+297
to
+302
| pub(super) fn managed_disks(&self) -> impl Iterator<Item = &Disk> { | ||
| self.disks.iter().filter_map(|disk_state| match &disk_state.state { | ||
| DiskState::Managed(disk) => Some(disk), | ||
| DiskState::FailedToManage(_) => None, | ||
| }) | ||
| } |
Contributor
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return an error if we hit the FailedToManage state anywhere, or — as is done here — silently omit any such disk?
andrewjstone
added a commit
that referenced
this pull request
Jan 31, 2026
This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When Trust Quorum commits a new epoch, all encrypted U.2 datasets need their encryption keys rotated. This change implements that flow:
trust-quorum: Add watch channel to broadcast committed epoch changes fromNodeTaskto subscriberssled-agent: Wirecommitted_epoch_rxto the config reconcilerconfig-reconciler:KeyRotationError,RekeyRequesttypes for the rekey APIrekey_datasetsbatch operation onDatasetTaskHandledatasets_rekeytoDatasetTaskin the ZFS operation serializer task for key rotationrekey_for_epochtoOmicronDatasetsto coordinate rekeying all managed disks when an epoch is committedmanaged_disksiterator toExternalDisksillumos-utils:Zfs::change_keyusingzfs-atomic-change-keycrate (temporarily) to rotate keys atomically with the change of theoxide:epochpropertyChangeKeyErrortypeepochfield toDatasetPropertiesand includeoxide:epochin ZFS property querieskey-manager: Add Debug derives to key typesThe rekey operation is idempotent: datasets already at the target epoch are skipped. On startup, we process the initial epoch to catch any missed rekeys from crashes.
Fixes #9587