TQ: Support for ZFS Key Rotation #9737

plaidfinch · 2026-01-28T22:22:51Z

When Trust Quorum commits a new epoch, all encrypted U.2 datasets need their encryption keys rotated. This change implements that flow:

trust-quorum: Add watch channel to broadcast committed epoch changes from NodeTask to subscribers
sled-agent: Wire committed_epoch_rx to the config reconciler
config-reconciler:
- Listen for epoch change notifications in the reconciler run loop and rekey datasets when the epoch changes
- Add KeyRotationError, RekeyRequest types for the rekey API
- Add rekey_datasets batch operation on DatasetTaskHandle
- Add datasets_rekey to DatasetTask in the ZFS operation serializer task for key rotation
- Add rekey_for_epoch to OmicronDatasets to coordinate rekeying all managed disks when an epoch is committed
- Add managed_disks iterator to ExternalDisks
illumos-utils:
- Add Zfs::change_key using zfs-atomic-change-key crate (temporarily) to rotate keys atomically with the change of the oxide:epoch property
- Add ChangeKeyError type
- Add epoch field to DatasetProperties and include oxide:epoch in ZFS property queries
key-manager: Add Debug derives to key types

The rekey operation is idempotent: datasets already at the target epoch are skipped. On startup, we process the initial epoch to catch any missed rekeys from crashes.

Fixes #9587

When Trust Quorum commits a new epoch, all encrypted U.2 datasets need their encryption keys rotated. This change implements that flow: - trust-quorum: Add watch channel to broadcast committed epoch changes from NodeTask to subscribers - sled-agent: Wire committed_epoch_rx to the config reconciler - config-reconciler: - Add KeyRotationError, RekeyRequest types for the rekey API - Add rekey_datasets() batch operation on DatasetTaskHandle - Add datasets_rekey() to DatasetTask for serialized key rotation - Add rekey_for_epoch() to OmicronDatasets to coordinate rekeying all managed disks when an epoch is committed - Handle epoch change notifications in the reconciler run loop - Add managed_disks() iterator to ExternalDisks - illumos-utils: - Add Zfs::change_key() using zfs-atomic-change-key crate - Add ChangeKeyError type - Add epoch field to DatasetProperties and include oxide:epoch in ZFS property queries - key-manager: Add Debug derives to key types The rekey operation is idempotent: datasets already at the target epoch are skipped. On startup, we process the initial epoch to catch any missed rekeys from crashes. Fixes #9587

key-manager/src/lib.rs

plaidfinch · 2026-01-28T22:27:23Z

sled-agent/config-reconciler/src/reconciler_task/external_disks.rs

+    pub(super) fn managed_disks(&self) -> impl Iterator<Item = &Disk> {
+        self.disks.iter().filter_map(|disk_state| match &disk_state.state {
+            DiskState::Managed(disk) => Some(disk),
+            DiskState::FailedToManage(_) => None,
+        })
+    }


Should we return an error if we hit the FailedToManage state anywhere, or — as is done here — silently omit any such disk?

This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.

plaidfinch added 3 commits January 28, 2026 17:15

fmt

1e2888e

Remove unused test code

eb41801

plaidfinch commented Jan 28, 2026

View reviewed changes

andrewjstone mentioned this pull request Jan 31, 2026

Trust Quorum Tracking #8262

Open

48 tasks

andrewjstone mentioned this pull request Jan 31, 2026

TQ: Support sled expunge via trust quorum pathway #9765

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TQ: Support for ZFS Key Rotation #9737

TQ: Support for ZFS Key Rotation #9737

plaidfinch commented Jan 28, 2026

Uh oh!

Uh oh!

plaidfinch Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TQ: Support for ZFS Key Rotation #9737

Are you sure you want to change the base?

TQ: Support for ZFS Key Rotation #9737

Conversation

plaidfinch commented Jan 28, 2026

Uh oh!

Uh oh!

plaidfinch Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants