[1.14] Implement virtio discard and write zeroes#13
Open
kalyazin wants to merge 19 commits intoe2b-dev:firecracker-v1.14-direct-memfrom
Open
[1.14] Implement virtio discard and write zeroes#13kalyazin wants to merge 19 commits intoe2b-dev:firecracker-v1.14-direct-memfrom
kalyazin wants to merge 19 commits intoe2b-dev:firecracker-v1.14-direct-memfrom
Conversation
The virtio-blk spec defines a 56-byte config space with fields for geometry, topology, writeback, and discard at fixed offsets. The previous struct only held `capacity` (8 bytes), so the driver never saw the fields beyond offset 8. The full layout is required to advertise VIRTIO_BLK_F_DISCARD: the spec mandates that `max_discard_sectors` (offset 40) and `max_discard_seg` (offset 44) be set before the feature can be used by the guest. Without the full struct, there is nowhere to write those values. All new fields default to zero, so there is no behavioral change in this commit. A compile-time size assertion (`assert!(size_of::<ConfigSpace>() == 56)`) guards against future accidental padding changes. Tests are updated to use `..Default::default()` for forward-compatible ConfigSpace init. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The virtio-blk spec (section 5.2.6.14) defines VIRTIO_BLK_T_DISCARD (type 11) as a request the guest driver issues to free sectors it no longer needs. To handle it, the VMM needs three things introduced here: - `RequestType::Discard` so the dispatcher can route the request type. - `DiscardWriteZeroes` (16 bytes, ByteValued) matching the layout of a single discard segment that the guest places in the data descriptor. - `DISCARD_SEGMENT_SIZE` constant (16) used in parse() to validate that the data buffer length is a non-zero multiple of the segment size. - `IoErr::InvalidFlags` / `IoErr::InvalidOffset` for per-segment validation errors added in the processing step. Placeholder arms in `finish()` and `process()` keep the build green; they are replaced with the real implementation in the next commit. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add SyncIoError::Discard, SyncFileEngine::discard() and AsyncFileEngine::discard() (synchronous fallback — discard is not latency-critical and IORING_OP_FALLOCATE is not in the io_uring allowlist). Wire up FileEngine::discard() dispatching to both engines. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Fix 2-space indented getrandom entries (should be 4 spaces like all other entries) and remove trailing whitespace from madvise comment. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
VIRTIO_BLK_F_DISCARD issues fallocate(FALLOC_FL_PUNCH_HOLE) on the backing file. Add fallocate to the vmm thread seccomp filter on both x86_64 and aarch64 so the syscall is not blocked at runtime. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add parse() validation: discard data descriptor must be read-only and data_len must be a non-zero multiple of DISCARD_SEGMENT_SIZE (16 bytes). Add process_discard_segments(): validates flags==0, num_sectors>0, and sector bounds per segment, then calls FileEngine::discard() using fallocate(FALLOC_FL_PUNCH_HOLE). Returns VIRTIO_BLK_S_IOERR on any validation or syscall failure; VIRTIO_BLK_S_OK on success. Add discard_count metric incremented on each successful discard. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Populate max_discard_sectors and max_discard_seg in restore() to match what new() produces for the expanded ConfigSpace. No version bump needed — main already carried 9.0.0 → 10.0.0 in this release cycle. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Set VIRTIO_BLK_F_DISCARD in avail_features for all writable block devices. Populate max_discard_sectors (capped to disk size) and max_discard_seg=1 in the config space. Read-only devices advertise neither the feature nor non-zero discard config fields. This is a behavioral change for existing writable block device configurations: guests will now see and may negotiate the discard feature. See docs/api_requests/block-discard.md for host filesystem requirements and recommendations. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Explain how discard works, host requirements, limitations, and snapshot compatibility impact. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Test feature flag advertisement, config space discard fields, and end-to-end discard request processing (success and error paths). Update test_virtio_features and test_virtio_read_config to account for the new VIRTIO_BLK_F_DISCARD bit and config space fields. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests: - test_discard: boot VM with writable ext4 disk, write+delete data, run fstrim, verify rc=0 and discard_count metric increments. - test_discard_not_advertised_for_read_only: verify read-only devices report discard_max_bytes=0 in sysfs (feature not advertised). Both parametrized over Sync and Async IO engines. Also add discard_count to the fcmetrics block_metrics list. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Replace the 4-byte _unused1 padding at offset 52 with the virtio-blk write-zeroes fields: - max_write_zeroes_sectors (u32, offset 52) - max_write_zeroes_seg (u32, offset 56) - write_zeroes_may_unmap (u8, offset 60) - _unused1 (u8;3, offset 61, padding to 64) Bump the compile-time size assertion to 64. Fields default to zero; no behavioural change. They will be populated when VIRTIO_BLK_F_WRITE_ZEROES is wired up. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add the RequestType::WriteZeroes variant and From<u32> mapping for VIRTIO_BLK_T_WRITE_ZEROES. Add IoErr::WriteZeroesUnsupported and Status::WriteZeroesUnsupported (parallel to the discard equivalents) that return VIRTIO_BLK_S_UNSUPP silently when the host filesystem doesn't support the operation. Add a write_zeroes_unsupported EOPNOTSUPP cache on DiskProperties (not persisted). Add placeholder arms in PendingRequest::finish() and Request::process() so the code compiles. Both currently return WriteZeroesUnsupported; they will be replaced when the full feature is wired up. Update the test-only From<RequestType> for u32 and request_type_flags() helpers to handle the new variant. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add write_zeroes(offset, len, unmap, req) on FileEngine plus the corresponding SyncFileEngine::write_zeroes() and AsyncFileEngine::push_write_zeroes() implementations. The unmap flag selects the fallocate mode at call time: - unmap=true: FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE (mode 3) - unmap=false: FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE (mode 17) The async path uses io_uring's IORING_OP_FALLOCATE (already in the allowlist from the discard work). The sync path calls libc::fallocate. Add SyncIoError::WriteZeroes and extend BlockIoError::is_eopnotsupp() to also match SyncIoError::WriteZeroes so the EOPNOTSUPP fallback machinery (added later) catches both modes. No seccomp / io_uring restriction changes are required: fallocate is already in the seccomp filter and IORING_OP_FALLOCATE is already allowed via the discard work. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Wire up Request::parse() validation and Request::process() handling
for write-zeroes requests, mirroring the discard path:
- parse(): reject write-only data descriptors and require data_len
to be a non-zero multiple of DISCARD_SEGMENT_SIZE (the
DiscardWriteZeroes struct is shared between both ops).
- parse_write_zeroes_segment(): validate flags (only bit 0,
VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, is permitted), reject
num_sectors=0, check sector range, and return the unmap flag
so process() can pick the fallocate mode.
- process(): short-circuit to VIRTIO_BLK_S_UNSUPP if a previous
EOPNOTSUPP set disk.write_zeroes_unsupported; otherwise dispatch
to FileEngine::write_zeroes().
- the (Ok(_), RequestType::WriteZeroes) finish() arm now increments
the write_zeroes_count metric and returns VIRTIO_BLK_S_OK.
- the EOPNOTSUPP detection in the error path also handles
RequestType::WriteZeroes, setting the cache and returning
Status::WriteZeroesUnsupported (silent UNSUPP).
Add the write_zeroes_count metric and aggregate() line.
The feature flag is not yet advertised — that lands in the next
commit, after the request handling is fully in place.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
…vices Set VIRTIO_BLK_F_WRITE_ZEROES in avail_features for any non-read-only block device, alongside VIRTIO_BLK_F_DISCARD. Populate the max_write_zeroes_sectors, max_write_zeroes_seg, and write_zeroes_may_unmap config fields in both VirtioBlock::new() and the persist::restore() path so the values match what the guest sees on a fresh boot vs after a snapshot restore. write_zeroes_may_unmap=1 lets the guest set the UNMAP flag on individual segments, which we then translate to fallocate's PUNCH_HOLE mode (UNMAP=0 uses ZERO_RANGE). Update the test_virtio_features and test_virtio_read_config expectations to account for the new feature bit and config fields. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add three tests parametrised over Sync and Async file engines:
- test_write_zeroes_feature_and_config: VIRTIO_BLK_F_WRITE_ZEROES
advertised on writable devices; max_write_zeroes_sectors,
max_write_zeroes_seg, and write_zeroes_may_unmap populated as
expected; read-only devices advertise none of these.
- test_write_zeroes: end-to-end happy path (UNMAP=1 → PUNCH_HOLE,
metric incremented, VIRTIO_BLK_S_OK) plus four error cases
(reserved flag bit set, sector range out of bounds, num_sectors=0,
invalid data length).
- test_write_zeroes_unsupported_cached: pre-set the cache flag and
verify subsequent requests return VIRTIO_BLK_S_UNSUPP without
incrementing the success or invalid-request counters.
The UNMAP=0 (ZERO_RANGE) happy path is not asserted in the unit
tests because ZERO_RANGE support is filesystem-dependent (the host's
tmpfs in some kernels returns EOPNOTSUPP); end-to-end UNMAP=0
correctness is left to the integration tests running on ext4. The
unit tests cover wire-flag handling for UNMAP=0 via the
invalid-flags and zero-sectors cases.
Also extend test_request_type_from to assert the new
VIRTIO_BLK_T_WRITE_ZEROES → RequestType::WriteZeroes mapping.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests, parametrised over Sync and Async IO engines:
- test_write_zeroes: boot VM with a writable block device, write
random data to a 1 MiB region of /dev/vdb, run `blkdiscard -z`
(BLKZEROOUT ioctl), then verify:
* sysfs /queue/write_zeroes_max_bytes is non-zero (feature
negotiated)
* the region reads back as zeros after the operation
* no requests failed
- test_write_zeroes_not_advertised_for_read_only: read-only device
reports write_zeroes_max_bytes=0 in sysfs.
The test deliberately does not assert that the write_zeroes_count
metric incremented. The Linux kernel's blkdev_issue_zeroout() may
legitimately fall back to issuing plain zero-page writes
(REQ_OP_WRITE) instead of REQ_OP_WRITE_ZEROES even when the feature
is advertised, depending on internal heuristics. Both paths leave
the device reading as zeros, so a metric assertion would be flaky in
CI without indicating any actual bug. Direct WRITE_ZEROES
request-handling correctness is covered by the unit tests.
Use cmp -n /dev/vdb /dev/zero for the zero check (od collapses
repeated lines with *, which makes naive byte-comparison fragile).
Add write_zeroes_count to host_tools/fcmetrics.py so the metrics
schema validation passes.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add docs/api_requests/block-write-zeroes.md describing: - automatic advertisement on writable devices - UNMAP=0 → FALLOC_FL_ZERO_RANGE (zeros in place) - UNMAP=1 → FALLOC_FL_PUNCH_HOLE (zeros + deallocate) - host filesystem requirements - EOPNOTSUPP fallback (silent VIRTIO_BLK_S_UNSUPP, shared cache) - known limitations Remove the "write_zeroes is not supported" line from block-discard.md now that the feature is implemented. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
The feature bits are advertised by Firecracker for rw block devices via calling:
If the underlying host filesystem doesn't support those (EOPNOTSUPP returned), Firecracker caches the knowledge and does not attempt to run them again.
No new Firecracker API is introduced (no opt-in is required to make use of the feature).
Note: snapshot rebuild is required to make use of the feature as the feature bits are negotiated at the time the block device is initialised (usually on boot).
Reason
Reduce host memory consumed by the drive.