diff --git a/docs/api_requests/block-discard.md b/docs/api_requests/block-discard.md new file mode 100644 index 00000000000..1d18ff98f7c --- /dev/null +++ b/docs/api_requests/block-discard.md @@ -0,0 +1,42 @@ +# Block device discard (TRIM) + +Firecracker supports the `VIRTIO_BLK_F_DISCARD` feature, which allows the guest +to issue discard (TRIM) requests to the block device. Discard requests tell the +host that a range of sectors is no longer needed, enabling the host to reclaim +space on sparse or thin-provisioned backing files. + +## How it works + +For all non-read-only block devices, Firecracker automatically advertises the +`VIRTIO_BLK_F_DISCARD` feature to the guest driver. No API configuration is +required — discard support is always-on for writable drives. + +When the guest driver issues a `VIRTIO_BLK_T_DISCARD` request, Firecracker calls +`fallocate(2)` with `FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE` on the backing +file for each discard segment. This punches a hole in the file, freeing the +underlying disk blocks without changing the file size. + +Guest tools that trigger discard include: + +- `fstrim -v /` — manually trim a mounted filesystem +- `discard` mount option — automatic discard on file deletion +- `blkdiscard /dev/vda` — discard the entire block device + +## Host requirements + +The backing file must reside on a filesystem and kernel that support +`FALLOC_FL_PUNCH_HOLE`. This is supported on ext4, xfs, btrfs, and tmpfs on +Linux 3.5+. On filesystems that do not support hole-punching, `fallocate` +returns `EOPNOTSUPP`. Firecracker detects this on the first discard, logs a +one-time warning, and replies to the guest with `VIRTIO_BLK_S_UNSUPP`. The Linux +virtio-blk driver propagates `VIRTIO_BLK_S_UNSUPP` through the block layer and +stops issuing further discard requests. Firecracker short-circuits any remaining +discard requests with `VIRTIO_BLK_S_UNSUPP` immediately — no additional +`fallocate` calls are made. + +## Limitations + +- Discard is only available for non-read-only block devices. +- At most one discard segment per request is supported (`max_discard_seg = 1`). +- The discard segment flags field must be zero; non-zero flags are rejected with + an I/O error. diff --git a/docs/api_requests/block-write-zeroes.md b/docs/api_requests/block-write-zeroes.md new file mode 100644 index 00000000000..26dc4b5b100 --- /dev/null +++ b/docs/api_requests/block-write-zeroes.md @@ -0,0 +1,66 @@ +# Block device write-zeroes + +Firecracker supports the `VIRTIO_BLK_F_WRITE_ZEROES` feature, which allows the +guest to ask the device to zero a range of sectors without transferring a buffer +of zeros over the virtqueue. Common consumers are `mkfs` (clearing inode tables +and journals), filesystem snapshots, encrypted-volume initial wipe, and +`blkdiscard -z` / `blkzeroout` from userspace. + +## How it works + +For all non-read-only block devices, Firecracker automatically advertises the +`VIRTIO_BLK_F_WRITE_ZEROES` feature to the guest driver. No API configuration +is required — write-zeroes support is always-on for writable drives. + +Each `VIRTIO_BLK_T_WRITE_ZEROES` request carries a 16-byte segment with a +`flags` field. Bit 0 (`VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP`) tells the device +whether it may also deallocate the underlying backing-file blocks. Firecracker +advertises `write_zeroes_may_unmap=1`, so guests are free to set this flag. + +Firecracker translates the guest's UNMAP bit into a `fallocate(2)` mode on the +backing file: + +| UNMAP | fallocate mode | Effect | +|-------|---------------------------------------------|---------------------------------------| +| 0 | `FALLOC_FL_ZERO_RANGE \| FALLOC_FL_KEEP_SIZE` | zeros in place, no deallocation | +| 1 | `FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE` | zeros + deallocate (sparse holes) | + +The virtio spec requires that when UNMAP is clear the device MUST NOT +deallocate sectors (so `ZERO_RANGE` is mandatory for that path); when UNMAP +is set, the device MAY deallocate, and `PUNCH_HOLE` reads as zeros on every +filesystem that supports it. + +## Host requirements + +The backing file must reside on a filesystem that supports the corresponding +`fallocate` mode: + +- `FALLOC_FL_PUNCH_HOLE` (UNMAP=1) is widely supported: ext4, xfs, btrfs, tmpfs. +- `FALLOC_FL_ZERO_RANGE` (UNMAP=0) is supported on ext4, xfs, btrfs; on tmpfs + it requires Linux 6.8+. Other filesystems may not support it. + +If `fallocate` returns `EOPNOTSUPP` for either mode, Firecracker logs a one-time +warning and replies with `VIRTIO_BLK_S_UNSUPP`. The Linux virtio-blk driver +propagates that status through the block layer and stops issuing further +write-zeroes requests, so subsequent guest writes fall back to plain +`REQ_OP_WRITE` traffic. Firecracker short-circuits any in-flight write-zeroes +requests with `VIRTIO_BLK_S_UNSUPP` for the rest of the device's lifetime — no +additional `fallocate` calls are made. + +The EOPNOTSUPP cache is shared across UNMAP=0 and UNMAP=1 paths: a single +fallback flag disables both. This is conservative — a filesystem that +supports `PUNCH_HOLE` but not `ZERO_RANGE` will see UNMAP=1 requests rejected +once an UNMAP=0 request fails — but it matches the discard fallback design +and avoids subtle host-side state. + +## Limitations + +- Write-zeroes is only available for non-read-only block devices. +- At most one segment per request is supported (`max_write_zeroes_seg = 1`). +- Only bit 0 (UNMAP) of the segment flags is allowed; non-zero reserved bits + are rejected with an I/O error. +- `EOPNOTSUPP` errors from the io_uring async engine are surfaced as + `VIRTIO_BLK_S_IOERR`, not silent `VIRTIO_BLK_S_UNSUPP` (the cache only + triggers for the synchronous engine, which is where the EOPNOTSUPP + detection currently lives). For async-engine devices, configure the + backing filesystem to support both modes. diff --git a/resources/seccomp/aarch64-unknown-linux-musl.json b/resources/seccomp/aarch64-unknown-linux-musl.json index 26dd661e46b..487c0346181 100644 --- a/resources/seccomp/aarch64-unknown-linux-musl.json +++ b/resources/seccomp/aarch64-unknown-linux-musl.json @@ -42,6 +42,10 @@ { "syscall": "fsync" }, + { + "syscall": "fallocate", + "comment": "Used by the block device for VIRTIO_BLK_F_DISCARD (FALLOC_FL_PUNCH_HOLE)" + }, { "syscall": "close" }, @@ -110,8 +114,8 @@ "comment": "sigaltstack is used by Rust stdlib to remove alternative signal stack during thread teardown." }, { - "syscall": "getrandom", - "comment": "getrandom is used by aws-lc library which we consume in virtio-rng" + "syscall": "getrandom", + "comment": "getrandom is used by aws-lc library which we consume in virtio-rng" }, { "syscall": "accept4", @@ -213,7 +217,7 @@ }, { "syscall": "madvise", - "comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms." + "comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms." }, { "syscall": "msync", @@ -544,8 +548,8 @@ "comment": "sigaltstack is used by Rust stdlib to remove alternative signal stack during thread teardown." }, { - "syscall": "getrandom", - "comment": "getrandom is used by `HttpServer` to reinialize `HashMap` after moving to the API thread" + "syscall": "getrandom", + "comment": "getrandom is used by `HttpServer` to reinialize `HashMap` after moving to the API thread" }, { "syscall": "accept4", diff --git a/resources/seccomp/x86_64-unknown-linux-musl.json b/resources/seccomp/x86_64-unknown-linux-musl.json index 1eb2d83e0f2..988967a4e22 100644 --- a/resources/seccomp/x86_64-unknown-linux-musl.json +++ b/resources/seccomp/x86_64-unknown-linux-musl.json @@ -45,6 +45,10 @@ { "syscall": "fsync" }, + { + "syscall": "fallocate", + "comment": "Used by the block device for VIRTIO_BLK_F_DISCARD (FALLOC_FL_PUNCH_HOLE)" + }, { "syscall": "close" }, @@ -559,8 +563,8 @@ "comment": "sigaltstack is used by Rust stdlib to remove alternative signal stack during thread teardown." }, { - "syscall": "getrandom", - "comment": "getrandom is used by `HttpServer` to reinialize `HashMap` after moving to the API thread" + "syscall": "getrandom", + "comment": "getrandom is used by `HttpServer` to reinialize `HashMap` after moving to the API thread" }, { "syscall": "accept4", diff --git a/src/vmm/src/devices/virtio/block/virtio/device.rs b/src/vmm/src/devices/virtio/block/virtio/device.rs index ecdd8ee4f6d..452b53b55ca 100644 --- a/src/vmm/src/devices/virtio/block/virtio/device.rs +++ b/src/vmm/src/devices/virtio/block/virtio/device.rs @@ -27,7 +27,8 @@ use crate::devices::virtio::block::CacheType; use crate::devices::virtio::block::virtio::metrics::{BlockDeviceMetrics, BlockMetricsPerDevice}; use crate::devices::virtio::device::{ActiveState, DeviceState, VirtioDevice}; use crate::devices::virtio::generated::virtio_blk::{ - VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_RO, VIRTIO_BLK_ID_BYTES, + VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_WRITE_ZEROES, + VIRTIO_BLK_ID_BYTES, }; use crate::devices::virtio::generated::virtio_config::VIRTIO_F_VERSION_1; use crate::devices::virtio::generated::virtio_ids::VIRTIO_ID_BLOCK; @@ -59,6 +60,11 @@ pub struct DiskProperties { pub file_engine: FileEngine, pub nsectors: u64, pub image_id: [u8; VIRTIO_BLK_ID_BYTES as usize], + /// Set on first EOPNOTSUPP from fallocate; subsequent discards are no-ops. Not persisted. + pub discard_unsupported: bool, + /// Set on first EOPNOTSUPP from fallocate during write-zeroes; subsequent + /// write-zeroes requests are no-ops. Not persisted. + pub write_zeroes_unsupported: bool, } impl DiskProperties { @@ -106,6 +112,8 @@ impl DiskProperties { .map_err(VirtioBlockError::FileEngine)?, nsectors: disk_size >> SECTOR_SHIFT, image_id, + discard_unsupported: false, + write_zeroes_unsupported: false, }) } @@ -163,10 +171,30 @@ impl DiskProperties { #[derive(Debug, Default, Clone, Copy, Eq, PartialEq)] #[repr(C)] pub struct ConfigSpace { - pub capacity: u64, + pub capacity: u64, // offset 0 + pub size_max: u32, // offset 8 + pub seg_max: u32, // offset 12 + pub geometry_cylinders: u16, // offset 16 + pub geometry_heads: u8, // offset 18 + pub geometry_sectors: u8, // offset 19 + pub blk_size: u32, // offset 20 + pub topology_physical_block_exp: u8, // offset 24 + pub topology_alignment_offset: u8, // offset 25 + pub topology_min_io_size: u16, // offset 26 + pub topology_opt_io_size: u32, // offset 28 + pub writeback: u8, // offset 32 + pub(crate) _unused0: [u8; 3], // offset 33 + pub num_queues: u32, // offset 36 + pub max_discard_sectors: u32, // offset 40 + pub max_discard_seg: u32, // offset 44 + pub discard_sector_alignment: u32, // offset 48 + pub max_write_zeroes_sectors: u32, // offset 52 + pub max_write_zeroes_seg: u32, // offset 56 + pub write_zeroes_may_unmap: u8, // offset 60 + pub(crate) _unused1: [u8; 3], // offset 61 (padding to 64) } - -// SAFETY: `ConfigSpace` contains only PODs in `repr(C)` or `repr(transparent)`, without padding. +const _: () = assert!(std::mem::size_of::() == 64); +// SAFETY: repr(C), all POD fields, explicit padding — no implicit padding bytes. unsafe impl ByteValued for ConfigSpace {} /// Use this structure to set up the Block Device before booting the kernel. @@ -306,14 +334,32 @@ impl VirtioBlock { if config.is_read_only { avail_features |= 1u64 << VIRTIO_BLK_F_RO; - }; + } else { + avail_features |= 1u64 << VIRTIO_BLK_F_DISCARD; + avail_features |= 1u64 << VIRTIO_BLK_F_WRITE_ZEROES; + } let queue_evts = [EventFd::new(libc::EFD_NONBLOCK).map_err(VirtioBlockError::EventFd)?]; let queues = BLOCK_QUEUE_SIZES.iter().map(|&s| Queue::new(s)).collect(); + let discard_sectors = u32::try_from(disk_properties.nsectors).unwrap_or(u32::MAX); let config_space = ConfigSpace { capacity: disk_properties.nsectors.to_le(), + max_discard_sectors: if !config.is_read_only { + discard_sectors + } else { + 0 + }, + max_discard_seg: if !config.is_read_only { 1 } else { 0 }, + max_write_zeroes_sectors: if !config.is_read_only { + discard_sectors + } else { + 0 + }, + max_write_zeroes_seg: if !config.is_read_only { 1 } else { 0 }, + write_zeroes_may_unmap: if !config.is_read_only { 1 } else { 0 }, + ..Default::default() }; Ok(VirtioBlock { @@ -483,17 +529,42 @@ impl VirtioBlock { } Ok(None) => break, Ok(Some(cqe)) => { - let res = cqe.result(); - let user_data = cqe.user_data(); - - let (pending, res) = match res { - Ok(count) => (user_data, Ok(count)), - Err(error) => ( - user_data, - Err(IoErr::FileEngine(block_io::BlockIoError::Async( - async_io::AsyncIoError::IO(error), - ))), - ), + let cqe_result = cqe.result(); + let pending = cqe.user_data(); + + // io_uring CQE errors use negated errno, so EOPNOTSUPP is -95. + let is_eopnotsupp = matches!( + &cqe_result, + Err(e) if e.raw_os_error() == Some(-libc::EOPNOTSUPP) + ); + if is_eopnotsupp && pending.request_type() == RequestType::Discard { + if !self.disk.discard_unsupported { + warn!( + "Block discard not supported by host filesystem; disabling discard" + ); + self.disk.discard_unsupported = true; + } + let finished = pending.finish( + &active_state.mem, + Err(IoErr::DiscardUnsupported), + &self.metrics, + ); + queue + .add_used(finished.desc_idx, finished.num_bytes_to_mem) + .unwrap_or_else(|err| { + error!( + "Failed to add available descriptor head {}: {}", + finished.desc_idx, err + ) + }); + continue; + } + + let res = match cqe_result { + Ok(count) => Ok(count), + Err(error) => Err(IoErr::FileEngine(block_io::BlockIoError::Async( + async_io::AsyncIoError::IO(error), + ))), }; let finished = pending.finish(&active_state.mem, res, &self.metrics); queue @@ -792,7 +863,12 @@ mod tests { assert_eq!(block.device_type(), VIRTIO_ID_BLOCK); - let features: u64 = (1u64 << VIRTIO_F_VERSION_1) | (1u64 << VIRTIO_RING_F_EVENT_IDX); + // default_block is non-read-only, so VIRTIO_BLK_F_DISCARD and + // VIRTIO_BLK_F_WRITE_ZEROES are advertised. + let features: u64 = (1u64 << VIRTIO_F_VERSION_1) + | (1u64 << VIRTIO_RING_F_EVENT_IDX) + | (1u64 << VIRTIO_BLK_F_DISCARD) + | (1u64 << VIRTIO_BLK_F_WRITE_ZEROES); assert_eq!( block.avail_features_by_page(0), @@ -818,14 +894,24 @@ mod tests { let mut actual_config_space = ConfigSpace::default(); block.read_config(0, actual_config_space.as_mut_slice()); - // This will read the number of sectors. // The block's backing file size is 0x1000, so there are 8 (4096/512) sectors. - // The config space is little endian. - let expected_config_space = ConfigSpace { capacity: 8 }; + // default_block is non-read-only, so discard and write-zeroes fields are populated. + let expected_config_space = ConfigSpace { + capacity: 8, + max_discard_sectors: 8, + max_discard_seg: 1, + max_write_zeroes_sectors: 8, + max_write_zeroes_seg: 1, + write_zeroes_may_unmap: 1, + ..Default::default() + }; assert_eq!(actual_config_space, expected_config_space); // Invalid read. - let expected_config_space = ConfigSpace { capacity: 696969 }; + let expected_config_space = ConfigSpace { + capacity: 696969, + ..Default::default() + }; actual_config_space = expected_config_space; block.read_config( std::mem::size_of::() as u64 + 1, @@ -842,7 +928,10 @@ mod tests { for engine in [FileEngineType::Sync, FileEngineType::Async] { let mut block = default_block(engine); - let expected_config_space = ConfigSpace { capacity: 696969 }; + let expected_config_space = ConfigSpace { + capacity: 696969, + ..Default::default() + }; block.write_config(0, expected_config_space.as_slice()); let mut actual_config_space = ConfigSpace::default(); @@ -852,6 +941,7 @@ mod tests { // If privileged user writes to `/dev/mem`, in block config space - byte by byte. let expected_config_space = ConfigSpace { capacity: 0x1122334455667788, + ..Default::default() }; let expected_config_space_slice = expected_config_space.as_slice(); for (i, b) in expected_config_space_slice.iter().enumerate() { @@ -863,6 +953,7 @@ mod tests { // Invalid write. let new_config_space = ConfigSpace { capacity: 0xDEADBEEF, + ..Default::default() }; block.write_config(5, new_config_space.as_slice()); // Make sure nothing got written. @@ -1864,4 +1955,431 @@ mod tests { assert_eq!(block.disk.image_id, id.as_slice()); } } + + #[test] + fn test_discard_feature_and_config() { + // Non-read-only block: VIRTIO_BLK_F_DISCARD set, discard config fields populated. + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let block = default_block(engine); + assert_ne!(block.avail_features & (1u64 << VIRTIO_BLK_F_DISCARD), 0); + assert_eq!(block.avail_features & (1u64 << VIRTIO_BLK_F_RO), 0); + // default_block has 0x1000-byte backing file → 8 sectors. + assert_eq!(block.config_space.max_discard_sectors, 8); + assert_eq!(block.config_space.max_discard_seg, 1); + } + + // Read-only block: VIRTIO_BLK_F_RO set, no discard. + let f = TempFile::new().unwrap(); + f.as_file().set_len(0x1000).unwrap(); + let config = VirtioBlockConfig { + drive_id: "test_ro".to_string(), + path_on_host: f.as_path().to_str().unwrap().to_string(), + is_root_device: false, + partuuid: None, + is_read_only: true, + cache_type: CacheType::Unsafe, + rate_limiter: None, + file_engine_type: FileEngineType::default(), + }; + let ro_block = VirtioBlock::new(config).unwrap(); + assert_eq!(ro_block.avail_features & (1u64 << VIRTIO_BLK_F_DISCARD), 0); + assert_ne!(ro_block.avail_features & (1u64 << VIRTIO_BLK_F_RO), 0); + assert_eq!(ro_block.config_space.max_discard_sectors, 0); + assert_eq!(ro_block.config_space.max_discard_seg, 0); + } + + #[test] + fn test_discard() { + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let mut block = default_block(engine); + let mem = default_mem(); + let interrupt = default_interrupt(); + let vq = VirtQueue::new(GuestAddress(0), &mem, 16); + set_queue(&mut block, 0, vq.create_queue()); + block.activate(mem.clone(), interrupt).unwrap(); + read_blk_req_descriptors(&vq); + + let request_type_addr = GuestAddress(vq.dtable[0].addr.get()); + let data_addr = GuestAddress(vq.dtable[1].addr.get()); + let status_addr = GuestAddress(vq.dtable[2].addr.get()); + + // Discard data descriptor is read-only (guest sends the segment list). + vq.dtable[1].flags.set(VIRTQ_DESC_F_NEXT); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE); + + // Valid discard: sector 0, 4 sectors, flags=0. + { + mem.write_obj::(VIRTIO_BLK_T_DISCARD, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + check_metric_after_block!( + &block.metrics.discard_count, + 1, + simulate_queue_and_async_completion_events(&mut block, true) + ); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().id, 0); + assert_eq!(vq.used.ring[0].get().len, 1); // status byte only + assert_eq!(mem.read_obj::(status_addr).unwrap(), VIRTIO_BLK_S_OK); + } + + // Invalid: non-zero flags rejected. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + mem.write_obj::(VIRTIO_BLK_T_DISCARD, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: 1, + }, + data_addr, + ) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_IOERR + ); + } + + // Invalid: sector range exceeds disk bounds. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + mem.write_obj::(VIRTIO_BLK_T_DISCARD, request_type_addr) + .unwrap(); + // 8-sector disk; sector 7 + 4 = 11 > 8. + mem.write_obj( + DiscardWriteZeroes { + sector: 7, + num_sectors: 4, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_IOERR + ); + } + + // Invalid data length: not a multiple of DISCARD_SEGMENT_SIZE → parse error. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE - 1); + mem.write_obj::(VIRTIO_BLK_T_DISCARD, request_type_addr) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + // Parse error → descriptor chain discarded (used len = 0). + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().len, 0); + } + } + } + + #[test] + fn test_discard_unsupported_cached() { + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let mut block = default_block(engine); + let mem = default_mem(); + let interrupt = default_interrupt(); + let vq = VirtQueue::new(GuestAddress(0), &mem, 16); + set_queue(&mut block, 0, vq.create_queue()); + block.activate(mem.clone(), interrupt).unwrap(); + read_blk_req_descriptors(&vq); + + let request_type_addr = GuestAddress(vq.dtable[0].addr.get()); + let data_addr = GuestAddress(vq.dtable[1].addr.get()); + let status_addr = GuestAddress(vq.dtable[2].addr.get()); + + vq.dtable[1].flags.set(VIRTQ_DESC_F_NEXT); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE); + + mem.write_obj::(VIRTIO_BLK_T_DISCARD, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + // Pre-set the flag as if a previous EOPNOTSUPP was already detected. + block.disk.discard_unsupported = true; + + // Neither discard_count nor invalid_reqs_count should increment. + let discard_before = block.metrics.discard_count.count(); + let invalid_before = block.metrics.invalid_reqs_count.count(); + simulate_queue_and_async_completion_events(&mut block, true); + assert_eq!(block.metrics.discard_count.count(), discard_before); + assert_eq!(block.metrics.invalid_reqs_count.count(), invalid_before); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().len, 1); // status byte only + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_UNSUPP + ); + } + } + + #[test] + fn test_write_zeroes_feature_and_config() { + // Non-read-only block: VIRTIO_BLK_F_WRITE_ZEROES set, write-zeroes config + // fields populated. + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let block = default_block(engine); + assert_ne!( + block.avail_features & (1u64 << VIRTIO_BLK_F_WRITE_ZEROES), + 0 + ); + assert_eq!(block.avail_features & (1u64 << VIRTIO_BLK_F_RO), 0); + // default_block has 0x1000-byte backing file → 8 sectors. + assert_eq!(block.config_space.max_write_zeroes_sectors, 8); + assert_eq!(block.config_space.max_write_zeroes_seg, 1); + assert_eq!(block.config_space.write_zeroes_may_unmap, 1); + } + + // Read-only block: VIRTIO_BLK_F_RO set, no write-zeroes. + let f = TempFile::new().unwrap(); + f.as_file().set_len(0x1000).unwrap(); + let config = VirtioBlockConfig { + drive_id: "test_ro_wz".to_string(), + path_on_host: f.as_path().to_str().unwrap().to_string(), + is_root_device: false, + partuuid: None, + is_read_only: true, + cache_type: CacheType::Unsafe, + rate_limiter: None, + file_engine_type: FileEngineType::default(), + }; + let ro_block = VirtioBlock::new(config).unwrap(); + assert_eq!( + ro_block.avail_features & (1u64 << VIRTIO_BLK_F_WRITE_ZEROES), + 0 + ); + assert_ne!(ro_block.avail_features & (1u64 << VIRTIO_BLK_F_RO), 0); + assert_eq!(ro_block.config_space.max_write_zeroes_sectors, 0); + assert_eq!(ro_block.config_space.max_write_zeroes_seg, 0); + assert_eq!(ro_block.config_space.write_zeroes_may_unmap, 0); + } + + #[test] + fn test_write_zeroes() { + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let mut block = default_block(engine); + let mem = default_mem(); + let interrupt = default_interrupt(); + let vq = VirtQueue::new(GuestAddress(0), &mem, 16); + set_queue(&mut block, 0, vq.create_queue()); + block.activate(mem.clone(), interrupt).unwrap(); + read_blk_req_descriptors(&vq); + + let request_type_addr = GuestAddress(vq.dtable[0].addr.get()); + let data_addr = GuestAddress(vq.dtable[1].addr.get()); + let status_addr = GuestAddress(vq.dtable[2].addr.get()); + + // Write-zeroes data descriptor is read-only (guest sends the segment list). + vq.dtable[1].flags.set(VIRTQ_DESC_F_NEXT); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE); + + // Valid write-zeroes with UNMAP=1 (PUNCH_HOLE — supported on tmpfs). + // The UNMAP=0 (ZERO_RANGE) happy path is not asserted here because + // ZERO_RANGE support is filesystem-dependent (e.g. older tmpfs returns + // EOPNOTSUPP); the wire-flag handling for UNMAP=0 is covered by the + // invalid-flag and zero-sectors cases below, and end-to-end correctness + // is left to the integration tests on ext4. + { + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, + }, + data_addr, + ) + .unwrap(); + + check_metric_after_block!( + &block.metrics.write_zeroes_count, + 1, + simulate_queue_and_async_completion_events(&mut block, true) + ); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().id, 0); + assert_eq!(vq.used.ring[0].get().len, 1); // status byte only + assert_eq!(mem.read_obj::(status_addr).unwrap(), VIRTIO_BLK_S_OK); + } + + // Invalid: reserved flag bit set (any bit other than UNMAP). + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: 0x2, // reserved bit + }, + data_addr, + ) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_IOERR + ); + } + + // Invalid: sector range exceeds disk bounds. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + // 8-sector disk; sector 7 + 4 = 11 > 8. + mem.write_obj( + DiscardWriteZeroes { + sector: 7, + num_sectors: 4, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_IOERR + ); + } + + // Invalid: num_sectors=0. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 0, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_IOERR + ); + } + + // Invalid data length: not a multiple of DISCARD_SEGMENT_SIZE → parse error. + { + vq.used.idx.set(0); + set_queue(&mut block, 0, vq.create_queue()); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE - 1); + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + + simulate_queue_event(&mut block, Some(true)); + + // Parse error → descriptor chain discarded (used len = 0). + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().len, 0); + } + } + } + + #[test] + fn test_write_zeroes_unsupported_cached() { + for engine in [FileEngineType::Sync, FileEngineType::Async] { + let mut block = default_block(engine); + let mem = default_mem(); + let interrupt = default_interrupt(); + let vq = VirtQueue::new(GuestAddress(0), &mem, 16); + set_queue(&mut block, 0, vq.create_queue()); + block.activate(mem.clone(), interrupt).unwrap(); + read_blk_req_descriptors(&vq); + + let request_type_addr = GuestAddress(vq.dtable[0].addr.get()); + let data_addr = GuestAddress(vq.dtable[1].addr.get()); + let status_addr = GuestAddress(vq.dtable[2].addr.get()); + + vq.dtable[1].flags.set(VIRTQ_DESC_F_NEXT); + vq.dtable[1].len.set(DISCARD_SEGMENT_SIZE); + + mem.write_obj::(VIRTIO_BLK_T_WRITE_ZEROES, request_type_addr) + .unwrap(); + mem.write_obj( + DiscardWriteZeroes { + sector: 0, + num_sectors: 4, + flags: 0, + }, + data_addr, + ) + .unwrap(); + + // Pre-set the flag as if a previous EOPNOTSUPP was already detected. + block.disk.write_zeroes_unsupported = true; + + // Neither write_zeroes_count nor invalid_reqs_count should increment. + let wz_before = block.metrics.write_zeroes_count.count(); + let invalid_before = block.metrics.invalid_reqs_count.count(); + simulate_queue_and_async_completion_events(&mut block, true); + assert_eq!(block.metrics.write_zeroes_count.count(), wz_before); + assert_eq!(block.metrics.invalid_reqs_count.count(), invalid_before); + + assert_eq!(vq.used.idx.get(), 1); + assert_eq!(vq.used.ring[0].get().len, 1); // status byte only + assert_eq!( + u32::from(mem.read_obj::(status_addr).unwrap()), + VIRTIO_BLK_S_UNSUPP + ); + } + } } diff --git a/src/vmm/src/devices/virtio/block/virtio/io/async_io.rs b/src/vmm/src/devices/virtio/block/virtio/io/async_io.rs index f83d9dea1df..089cee883a1 100644 --- a/src/vmm/src/devices/virtio/block/virtio/io/async_io.rs +++ b/src/vmm/src/devices/virtio/block/virtio/io/async_io.rs @@ -4,7 +4,7 @@ use std::fmt::Debug; use std::fs::File; use std::os::fd::RawFd; -use std::os::unix::io::AsRawFd; +use std::os::unix::io::AsRawFd as _; use vm_memory::GuestMemoryError; use vmm_sys_util::eventfd::EventFd; @@ -82,6 +82,7 @@ impl AsyncFileEngine { Restriction::AllowOpCode(OpCode::Read), Restriction::AllowOpCode(OpCode::Write), Restriction::AllowOpCode(OpCode::Fsync), + Restriction::AllowOpCode(OpCode::Fallocate), ], Some(completion_fd), ) @@ -119,6 +120,54 @@ impl AsyncFileEngine { &self.completion_evt } + pub fn push_discard( + &mut self, + offset: u64, + len: u64, + req: PendingRequest, + ) -> Result<(), RequestError> { + // FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE = 3 + const PUNCH_HOLE_MODE: u32 = 3; + let wrapped = WrappedRequest::new(req); + self.ring + .push(Operation::fallocate( + 0, + offset, + len, + PUNCH_HOLE_MODE, + wrapped, + )) + .map_err(|(io_uring_error, data)| RequestError { + req: data.req, + error: AsyncIoError::IoUring(io_uring_error), + }) + } + + pub fn push_write_zeroes( + &mut self, + offset: u64, + len: u64, + unmap: bool, + req: PendingRequest, + ) -> Result<(), RequestError> { + // FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE = 3 + // FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE = 17 + const PUNCH_HOLE_MODE: u32 = 3; + const ZERO_RANGE_MODE: u32 = 17; + let mode = if unmap { + PUNCH_HOLE_MODE + } else { + ZERO_RANGE_MODE + }; + let wrapped = WrappedRequest::new(req); + self.ring + .push(Operation::fallocate(0, offset, len, mode, wrapped)) + .map_err(|(io_uring_error, data)| RequestError { + req: data.req, + error: AsyncIoError::IoUring(io_uring_error), + }) + } + pub fn push_read( &mut self, offset: u64, diff --git a/src/vmm/src/devices/virtio/block/virtio/io/mod.rs b/src/vmm/src/devices/virtio/block/virtio/io/mod.rs index b7aa8061d76..c87ea83cac9 100644 --- a/src/vmm/src/devices/virtio/block/virtio/io/mod.rs +++ b/src/vmm/src/devices/virtio/block/virtio/io/mod.rs @@ -40,6 +40,14 @@ impl BlockIoError { _ => false, } } + + pub fn is_eopnotsupp(&self) -> bool { + matches!( + self, + BlockIoError::Sync(SyncIoError::Discard(e) | SyncIoError::WriteZeroes(e)) + if e.raw_os_error() == Some(libc::EOPNOTSUPP) + ) + } } #[derive(Debug)] @@ -157,6 +165,55 @@ impl FileEngine { } } + pub fn discard( + &mut self, + offset: u64, + len: u64, + req: PendingRequest, + ) -> Result> { + match self { + FileEngine::Async(engine) => match engine.push_discard(offset, len, req) { + Ok(()) => Ok(FileEngineOk::Submitted), + Err(err) => Err(RequestError { + req: err.req, + error: BlockIoError::Async(err.error), + }), + }, + FileEngine::Sync(engine) => match engine.discard(offset, len) { + Ok(()) => Ok(FileEngineOk::Executed(RequestOk { req, count: 0 })), + Err(err) => Err(RequestError { + req, + error: BlockIoError::Sync(err), + }), + }, + } + } + + pub fn write_zeroes( + &mut self, + offset: u64, + len: u64, + unmap: bool, + req: PendingRequest, + ) -> Result> { + match self { + FileEngine::Async(engine) => match engine.push_write_zeroes(offset, len, unmap, req) { + Ok(()) => Ok(FileEngineOk::Submitted), + Err(err) => Err(RequestError { + req: err.req, + error: BlockIoError::Async(err.error), + }), + }, + FileEngine::Sync(engine) => match engine.write_zeroes(offset, len, unmap) { + Ok(()) => Ok(FileEngineOk::Executed(RequestOk { req, count: 0 })), + Err(err) => Err(RequestError { + req, + error: BlockIoError::Sync(err), + }), + }, + } + } + pub fn drain(&mut self, discard: bool) -> Result<(), BlockIoError> { match self { FileEngine::Async(engine) => engine.drain(discard).map_err(BlockIoError::Async), diff --git a/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs b/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs index eec3b3d8b8d..929dd4e183d 100644 --- a/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs +++ b/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs @@ -3,6 +3,7 @@ use std::fs::File; use std::io::{Seek, SeekFrom, Write}; +use std::os::unix::io::AsRawFd; use vm_memory::{GuestMemoryError, ReadVolatile, WriteVolatile}; @@ -10,6 +11,8 @@ use crate::vstate::memory::{GuestAddress, GuestMemory, GuestMemoryMmap}; #[derive(Debug, thiserror::Error, displaydoc::Display)] pub enum SyncIoError { + /// Discard: {0} + Discard(std::io::Error), /// Flush: {0} Flush(std::io::Error), /// Seek: {0} @@ -18,6 +21,8 @@ pub enum SyncIoError { SyncAll(std::io::Error), /// Transfer: {0} Transfer(GuestMemoryError), + /// WriteZeroes: {0} + WriteZeroes(std::io::Error), } #[derive(Debug)] @@ -75,6 +80,52 @@ impl SyncFileEngine { Ok(count) } + pub fn discard(&mut self, offset: u64, len: u64) -> Result<(), SyncIoError> { + // SAFETY: self.file is a valid fd; offset and len are caller-validated. + let ret = unsafe { + libc::fallocate( + self.file.as_raw_fd(), + libc::FALLOC_FL_PUNCH_HOLE | libc::FALLOC_FL_KEEP_SIZE, + i64::try_from(offset).expect("discard offset fits in i64"), + i64::try_from(len).expect("discard length fits in i64"), + ) + }; + if ret < 0 { + Err(SyncIoError::Discard(std::io::Error::last_os_error())) + } else { + Ok(()) + } + } + + pub fn write_zeroes( + &mut self, + offset: u64, + len: u64, + unmap: bool, + ) -> Result<(), SyncIoError> { + // UNMAP=1 reuses PUNCH_HOLE (the spec lets the device deallocate); + // UNMAP=0 must zero in place without deallocating, so use ZERO_RANGE. + let mode = if unmap { + libc::FALLOC_FL_PUNCH_HOLE | libc::FALLOC_FL_KEEP_SIZE + } else { + libc::FALLOC_FL_ZERO_RANGE | libc::FALLOC_FL_KEEP_SIZE + }; + // SAFETY: self.file is a valid fd; offset and len are caller-validated. + let ret = unsafe { + libc::fallocate( + self.file.as_raw_fd(), + mode, + i64::try_from(offset).expect("write_zeroes offset fits in i64"), + i64::try_from(len).expect("write_zeroes length fits in i64"), + ) + }; + if ret < 0 { + Err(SyncIoError::WriteZeroes(std::io::Error::last_os_error())) + } else { + Ok(()) + } + } + pub fn flush(&mut self) -> Result<(), SyncIoError> { // flush() first to force any cached data out of rust buffers. self.file.flush().map_err(SyncIoError::Flush)?; diff --git a/src/vmm/src/devices/virtio/block/virtio/metrics.rs b/src/vmm/src/devices/virtio/block/virtio/metrics.rs index 0abe2a58963..ffd363ba559 100644 --- a/src/vmm/src/devices/virtio/block/virtio/metrics.rs +++ b/src/vmm/src/devices/virtio/block/virtio/metrics.rs @@ -184,6 +184,10 @@ pub struct BlockDeviceMetrics { pub io_engine_throttled_events: SharedIncMetric, /// Number of remaining requests in the queue. pub remaining_reqs_count: SharedIncMetric, + /// Number of successful discard (TRIM) operations. + pub discard_count: SharedIncMetric, + /// Number of successful write-zeroes operations. + pub write_zeroes_count: SharedIncMetric, } impl BlockDeviceMetrics { @@ -230,6 +234,9 @@ impl BlockDeviceMetrics { .add(other.io_engine_throttled_events.fetch_diff()); self.remaining_reqs_count .add(other.remaining_reqs_count.fetch_diff()); + self.discard_count.add(other.discard_count.fetch_diff()); + self.write_zeroes_count + .add(other.write_zeroes_count.fetch_diff()); } } diff --git a/src/vmm/src/devices/virtio/block/virtio/persist.rs b/src/vmm/src/devices/virtio/block/virtio/persist.rs index 98f17c258ad..f93f146c52e 100644 --- a/src/vmm/src/devices/virtio/block/virtio/persist.rs +++ b/src/vmm/src/devices/virtio/block/virtio/persist.rs @@ -111,8 +111,15 @@ impl Persist<'_> for VirtioBlock { let avail_features = state.virtio_state.avail_features; let acked_features = state.virtio_state.acked_features; + let discard_sectors = u32::try_from(disk_properties.nsectors).unwrap_or(u32::MAX); let config_space = ConfigSpace { capacity: disk_properties.nsectors.to_le(), + max_discard_sectors: if !is_read_only { discard_sectors } else { 0 }, + max_discard_seg: if !is_read_only { 1 } else { 0 }, + max_write_zeroes_sectors: if !is_read_only { discard_sectors } else { 0 }, + max_write_zeroes_seg: if !is_read_only { 1 } else { 0 }, + write_zeroes_may_unmap: if !is_read_only { 1 } else { 0 }, + ..Default::default() }; Ok(VirtioBlock { diff --git a/src/vmm/src/devices/virtio/block/virtio/request.rs b/src/vmm/src/devices/virtio/block/virtio/request.rs index 8fc83cf43da..ee51ab6de1f 100644 --- a/src/vmm/src/devices/virtio/block/virtio/request.rs +++ b/src/vmm/src/devices/virtio/block/virtio/request.rs @@ -14,18 +14,42 @@ use crate::devices::virtio::block::virtio::device::DiskProperties; use crate::devices::virtio::block::virtio::metrics::BlockDeviceMetrics; pub use crate::devices::virtio::generated::virtio_blk::{ VIRTIO_BLK_ID_BYTES, VIRTIO_BLK_S_IOERR, VIRTIO_BLK_S_OK, VIRTIO_BLK_S_UNSUPP, - VIRTIO_BLK_T_FLUSH, VIRTIO_BLK_T_GET_ID, VIRTIO_BLK_T_IN, VIRTIO_BLK_T_OUT, + VIRTIO_BLK_T_DISCARD, VIRTIO_BLK_T_FLUSH, VIRTIO_BLK_T_GET_ID, VIRTIO_BLK_T_IN, + VIRTIO_BLK_T_OUT, VIRTIO_BLK_T_WRITE_ZEROES, VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, }; use crate::devices::virtio::queue::DescriptorChain; -use crate::logger::{IncMetric, error}; +use crate::logger::{IncMetric, error, warn}; use crate::rate_limiter::{RateLimiter, TokenType}; -use crate::vstate::memory::{ByteValued, Bytes, GuestAddress, GuestMemoryMmap}; +use crate::vstate::memory::{Address, ByteValued, Bytes, GuestAddress, GuestMemoryMmap}; + +/// One virtio-blk discard/write_zeroes segment — virtio spec §5.2.6.14. +#[derive(Debug, Default, Copy, Clone)] +#[repr(C)] +pub struct DiscardWriteZeroes { + pub sector: u64, + pub num_sectors: u32, + pub flags: u32, +} +// SAFETY: repr(C), all POD fields, no implicit padding. +unsafe impl ByteValued for DiscardWriteZeroes {} + +pub const DISCARD_SEGMENT_SIZE: u32 = 16; +const _: () = assert!(std::mem::size_of::() == DISCARD_SEGMENT_SIZE as usize); #[derive(Debug, derive_more::From)] pub enum IoErr { GetId(GuestMemoryError), - PartialTransfer { completed: u32, expected: u32 }, + PartialTransfer { + completed: u32, + expected: u32, + }, FileEngine(block_io::BlockIoError), + InvalidOffset, + InvalidFlags, + /// Discard not supported by the host filesystem; cached after first EOPNOTSUPP. + DiscardUnsupported, + /// Write zeroes not supported by the host filesystem; cached after first EOPNOTSUPP. + WriteZeroesUnsupported, } #[derive(Clone, Copy, Debug, PartialEq, Eq)] @@ -34,6 +58,8 @@ pub enum RequestType { Out, Flush, GetDeviceID, + Discard, + WriteZeroes, Unsupported(u32), } @@ -44,6 +70,8 @@ impl From for RequestType { VIRTIO_BLK_T_OUT => RequestType::Out, VIRTIO_BLK_T_FLUSH => RequestType::Flush, VIRTIO_BLK_T_GET_ID => RequestType::GetDeviceID, + VIRTIO_BLK_T_DISCARD => RequestType::Discard, + VIRTIO_BLK_T_WRITE_ZEROES => RequestType::WriteZeroes, t => RequestType::Unsupported(t), } } @@ -64,9 +92,20 @@ pub struct FinishedRequest { #[derive(Debug)] enum Status { - Ok { num_bytes_to_mem: u32 }, - IoErr { num_bytes_to_mem: u32, err: IoErr }, - Unsupported { op: u32 }, + Ok { + num_bytes_to_mem: u32, + }, + IoErr { + num_bytes_to_mem: u32, + err: IoErr, + }, + Unsupported { + op: u32, + }, + /// Discard silently unsupported — returns VIRTIO_BLK_S_UNSUPP without logging or metrics. + DiscardUnsupported, + /// Write zeroes silently unsupported — returns VIRTIO_BLK_S_UNSUPP without logging or metrics. + WriteZeroesUnsupported, } impl Status { @@ -98,6 +137,10 @@ pub struct PendingRequest { } impl PendingRequest { + pub fn request_type(&self) -> RequestType { + self.r#type + } + fn write_status_and_finish( self, status: &Status, @@ -124,6 +167,8 @@ impl PendingRequest { error!("Received unsupported virtio block request: {}", op); (0, u8::try_from(VIRTIO_BLK_S_UNSUPP).unwrap()) } + Status::DiscardUnsupported => (0, u8::try_from(VIRTIO_BLK_S_UNSUPP).unwrap()), + Status::WriteZeroesUnsupported => (0, u8::try_from(VIRTIO_BLK_S_UNSUPP).unwrap()), }; let num_bytes_to_mem = mem @@ -176,6 +221,20 @@ impl PendingRequest { (Ok(transferred_data_len), RequestType::GetDeviceID) => { Status::from_data(self.data_len, transferred_data_len, true) } + (Ok(_), RequestType::Discard) => { + block_metrics.discard_count.inc(); + Status::Ok { + num_bytes_to_mem: 0, + } + } + (Ok(_), RequestType::WriteZeroes) => { + block_metrics.write_zeroes_count.inc(); + Status::Ok { + num_bytes_to_mem: 0, + } + } + (Err(IoErr::DiscardUnsupported), _) => Status::DiscardUnsupported, + (Err(IoErr::WriteZeroesUnsupported), _) => Status::WriteZeroesUnsupported, (_, RequestType::Unsupported(op)) => Status::Unsupported { op }, (Err(err), _) => Status::IoErr { num_bytes_to_mem: 0, @@ -187,6 +246,55 @@ impl PendingRequest { } } +fn parse_discard_segment( + data_addr: GuestAddress, + nsectors: u64, + mem: &GuestMemoryMmap, +) -> Result<(u64, u64), IoErr> { + // max_discard_seg = 1 guarantees exactly one segment per request. + let seg: DiscardWriteZeroes = mem.read_obj(data_addr).map_err(IoErr::GetId)?; + if seg.flags != 0 { + return Err(IoErr::InvalidFlags); + } + if seg.num_sectors == 0 { + return Err(IoErr::InvalidOffset); + } + seg.sector + .checked_add(u64::from(seg.num_sectors)) + .filter(|&top| top <= nsectors) + .ok_or(IoErr::InvalidOffset)?; + Ok(( + seg.sector << SECTOR_SHIFT, + u64::from(seg.num_sectors) << SECTOR_SHIFT, + )) +} + +fn parse_write_zeroes_segment( + data_addr: GuestAddress, + nsectors: u64, + mem: &GuestMemoryMmap, +) -> Result<(u64, u64, bool), IoErr> { + // max_write_zeroes_seg = 1 guarantees exactly one segment per request. + let seg: DiscardWriteZeroes = mem.read_obj(data_addr).map_err(IoErr::GetId)?; + // Only bit 0 (UNMAP) is defined; all other bits MUST be 0 per virtio spec + // §5.2.6.14. + if seg.flags & !VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP != 0 { + return Err(IoErr::InvalidFlags); + } + if seg.num_sectors == 0 { + return Err(IoErr::InvalidOffset); + } + seg.sector + .checked_add(u64::from(seg.num_sectors)) + .filter(|&top| top <= nsectors) + .ok_or(IoErr::InvalidOffset)?; + Ok(( + seg.sector << SECTOR_SHIFT, + u64::from(seg.num_sectors) << SECTOR_SHIFT, + seg.flags & VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP != 0, + )) +} + /// The request header represents the mandatory fields of each block device request. /// /// A request header contains the following fields: @@ -281,6 +389,12 @@ impl Request { if data_desc.is_write_only() && req.r#type == RequestType::Out { return Err(VirtioBlockError::UnexpectedWriteOnlyDescriptor); } + if data_desc.is_write_only() && req.r#type == RequestType::Discard { + return Err(VirtioBlockError::UnexpectedWriteOnlyDescriptor); + } + if data_desc.is_write_only() && req.r#type == RequestType::WriteZeroes { + return Err(VirtioBlockError::UnexpectedWriteOnlyDescriptor); + } if !data_desc.is_write_only() && req.r#type == RequestType::In { return Err(VirtioBlockError::UnexpectedReadOnlyDescriptor); } @@ -313,6 +427,16 @@ impl Request { return Err(VirtioBlockError::InvalidDataLength); } } + RequestType::Discard => { + if req.data_len == 0 || !req.data_len.is_multiple_of(DISCARD_SEGMENT_SIZE) { + return Err(VirtioBlockError::InvalidDataLength); + } + } + RequestType::WriteZeroes => { + if req.data_len == 0 || req.data_len % DISCARD_SEGMENT_SIZE != 0 { + return Err(VirtioBlockError::InvalidDataLength); + } + } _ => {} } @@ -390,6 +514,46 @@ impl Request { .map_err(IoErr::GetId); return ProcessingResult::Executed(pending.finish(mem, res, block_metrics)); } + RequestType::Discard => { + if disk.discard_unsupported { + return ProcessingResult::Executed(pending.finish( + mem, + Err(IoErr::DiscardUnsupported), + block_metrics, + )); + } + match parse_discard_segment(self.data_addr, disk.nsectors, mem) { + Err(io_err) => { + return ProcessingResult::Executed(pending.finish( + mem, + Err(io_err), + block_metrics, + )); + } + Ok((offset, len)) => disk.file_engine.discard(offset, len, pending), + } + } + RequestType::WriteZeroes => { + if disk.write_zeroes_unsupported { + return ProcessingResult::Executed(pending.finish( + mem, + Err(IoErr::WriteZeroesUnsupported), + block_metrics, + )); + } + match parse_write_zeroes_segment(self.data_addr, disk.nsectors, mem) { + Err(io_err) => { + return ProcessingResult::Executed(pending.finish( + mem, + Err(io_err), + block_metrics, + )); + } + Ok((offset, len, unmap)) => { + disk.file_engine.write_zeroes(offset, len, unmap, pending) + } + } + } RequestType::Unsupported(_) => { return ProcessingResult::Executed(pending.finish(mem, Ok(0), block_metrics)); } @@ -403,6 +567,33 @@ impl Request { Err(err) => { if err.error.is_throttling_err() { ProcessingResult::Throttled + } else if err.error.is_eopnotsupp() + && err.req.request_type() == RequestType::Discard + { + if !disk.discard_unsupported { + warn!("Block discard not supported by host filesystem; disabling discard"); + disk.discard_unsupported = true; + } + ProcessingResult::Executed(err.req.finish( + mem, + Err(IoErr::DiscardUnsupported), + block_metrics, + )) + } else if err.error.is_eopnotsupp() + && err.req.request_type() == RequestType::WriteZeroes + { + if !disk.write_zeroes_unsupported { + warn!( + "Block write_zeroes not supported by host filesystem; disabling \ + write_zeroes" + ); + disk.write_zeroes_unsupported = true; + } + ProcessingResult::Executed(err.req.finish( + mem, + Err(IoErr::WriteZeroesUnsupported), + block_metrics, + )) } else { ProcessingResult::Executed(err.req.finish( mem, @@ -475,6 +666,10 @@ mod tests { RequestType::from(VIRTIO_BLK_T_GET_ID), RequestType::GetDeviceID ); + assert_eq!( + RequestType::from(VIRTIO_BLK_T_WRITE_ZEROES), + RequestType::WriteZeroes + ); assert_eq!(RequestType::from(42), RequestType::Unsupported(42)); } @@ -731,6 +926,8 @@ mod tests { RequestType::Out => VIRTIO_BLK_T_OUT, RequestType::Flush => VIRTIO_BLK_T_FLUSH, RequestType::GetDeviceID => VIRTIO_BLK_T_GET_ID, + RequestType::Discard => VIRTIO_BLK_T_DISCARD, + RequestType::WriteZeroes => VIRTIO_BLK_T_WRITE_ZEROES, RequestType::Unsupported(id) => id, } } @@ -743,6 +940,8 @@ mod tests { RequestType::Out => VIRTQ_DESC_F_NEXT, RequestType::Flush => VIRTQ_DESC_F_NEXT, RequestType::GetDeviceID => VIRTQ_DESC_F_NEXT | VIRTQ_DESC_F_WRITE, + RequestType::Discard => VIRTQ_DESC_F_NEXT, + RequestType::WriteZeroes => VIRTQ_DESC_F_NEXT, RequestType::Unsupported(_) => VIRTQ_DESC_F_NEXT, } } diff --git a/src/vmm/src/io_uring/operation/mod.rs b/src/vmm/src/io_uring/operation/mod.rs index ead96df5ba5..48f2a89c208 100644 --- a/src/vmm/src/io_uring/operation/mod.rs +++ b/src/vmm/src/io_uring/operation/mod.rs @@ -29,6 +29,8 @@ pub enum OpCode { Write = io_uring_op::IORING_OP_WRITE as u8, /// Fsync operation. Fsync = io_uring_op::IORING_OP_FSYNC as u8, + /// Fallocate operation. + Fallocate = io_uring_op::IORING_OP_FALLOCATE as u8, } // Useful for outputting errors. @@ -38,6 +40,7 @@ impl From for &'static str { OpCode::Read => "read", OpCode::Write => "write", OpCode::Fsync => "fsync", + OpCode::Fallocate => "fallocate", } } } @@ -100,6 +103,26 @@ impl Operation { } } + /// Construct a fallocate operation. + /// + /// For IORING_OP_FALLOCATE the kernel reads the length from sqe->addr (u64) + /// and the mode flags from sqe->len (u32), with the file offset in sqe->off. + /// The existing `addr` and `len` struct fields map to exactly those SQE slots, + /// so no changes to `into_sqe` are needed. + // Firecracker targets 64-bit only; usize == u64 so this cast never truncates. + #[allow(clippy::cast_possible_truncation)] + pub fn fallocate(fd: FixedFd, offset: u64, len: u64, mode: u32, user_data: T) -> Self { + Self { + fd, + opcode: OpCode::Fallocate, + addr: Some(len as usize), // sqe->addr = fallocate length (u64) + len: Some(mode), // sqe->len = fallocate mode flags + flags: 0, + offset: Some(offset), + user_data, + } + } + /// Construct a fsync operation. pub fn fsync(fd: FixedFd, user_data: T) -> Self { Self { diff --git a/tests/host_tools/fcmetrics.py b/tests/host_tools/fcmetrics.py index e4f6f7e9813..04a8765e07b 100644 --- a/tests/host_tools/fcmetrics.py +++ b/tests/host_tools/fcmetrics.py @@ -95,6 +95,8 @@ def validate_fc_metrics(metrics): "rate_limiter_throttled_events", "io_engine_throttled_events", "remaining_reqs_count", + "discard_count", + "write_zeroes_count", {"read_agg": latency_agg_metrics_fields}, {"write_agg": latency_agg_metrics_fields}, ] diff --git a/tests/integration_tests/functional/test_drive_virtio.py b/tests/integration_tests/functional/test_drive_virtio.py index 9c61ead56a9..25d8f2cd421 100644 --- a/tests/integration_tests/functional/test_drive_virtio.py +++ b/tests/integration_tests/functional/test_drive_virtio.py @@ -383,3 +383,150 @@ def _check_mount(ssh_connection, dev_path): assert stderr == "" _, _, stderr = ssh_connection.run("umount /tmp", timeout=30.0) assert stderr == "" + + +def _mount_disk(ssh): + ssh.check_output("mkdir -p /tmp/mnt") + ssh.check_output("mount /dev/vdb /tmp/mnt") + + +def _fill_and_trim(ssh): + ssh.check_output("dd if=/dev/zero of=/tmp/mnt/fill bs=1M count=64 conv=fsync") + ssh.check_output("rm /tmp/mnt/fill && sync") + _, stdout, _ = ssh.check_output("fstrim -v /tmp/mnt") + assert "0 B" not in stdout, f"fstrim reported no bytes trimmed: {stdout}" + + +def test_discard(uvm_plain_any, microvm_factory, io_engine): + """ + Verify VIRTIO_BLK_F_DISCARD on a fresh boot and after snapshot/restore. + """ + vm = uvm_plain_any + vm.spawn() + vm.basic_config() + vm.add_net_iface() + fs = drive_tools.FilesystemFile(os.path.join(vm.fsfiles, "discard_test"), size=256) + vm.add_drive("discard_disk", fs.path, is_read_only=False, io_engine=io_engine) + vm.start() + + _mount_disk(vm.ssh) + _fill_and_trim(vm.ssh) + st = os.stat(fs.path) + assert st.st_blocks * 512 < st.st_size, "backing file has no holes after trim" + + snapshot = vm.snapshot_full() + vm = microvm_factory.build_from_snapshot(snapshot) + # Disk is still mounted in the restored guest; write+trim again. + _fill_and_trim(vm.ssh) + st = os.stat(fs.path) + assert st.st_blocks * 512 < st.st_size, "backing file has no holes after trim post-restore" + + metrics = vm.flush_metrics() + assert metrics["block"]["discard_count"] > 0 + + +def test_discard_not_advertised_for_read_only(uvm_plain_any, io_engine): + """ + Verify VIRTIO_BLK_F_DISCARD is NOT advertised for read-only block devices. + + The kernel exposes discard support via /sys/block//queue/discard_max_bytes. + A value of 0 means the device does not support discard. + """ + vm = uvm_plain_any + vm.spawn() + vm.basic_config() + vm.add_net_iface() + + fs = drive_tools.FilesystemFile(os.path.join(vm.fsfiles, "ro_disk"), size=64) + vm.add_drive("ro_disk", fs.path, is_read_only=True, io_engine=io_engine) + vm.start() + + _, stdout, _ = vm.ssh.check_output("cat /sys/block/vdb/queue/discard_max_bytes") + assert ( + stdout.strip() == "0" + ), f"Expected discard_max_bytes=0 for read-only device, got: {stdout.strip()}" + + +def _exercise_write_zeroes(ssh): + """Write random data, issue blkdiscard -z, verify zeros on /dev/vdb.""" + # Sysfs check: the kernel populates write_zeroes_max_bytes from the + # negotiated feature; a non-zero value proves the feature is advertised. + _, stdout, _ = ssh.check_output( + "cat /sys/block/vdb/queue/write_zeroes_max_bytes" + ) + assert int(stdout.strip()) > 0, ( + f"Expected non-zero write_zeroes_max_bytes, got: {stdout.strip()}" + ) + # Write random non-zero data so we can tell zeroing apart from + # "the device was already zero". + ssh.check_output("dd if=/dev/urandom of=/dev/vdb bs=1M count=1 conv=fsync") + ssh.check_output("sync && echo 3 > /proc/sys/vm/drop_caches") + # Issue zero-out via blkdiscard -z (BLKZEROOUT ioctl). + ssh.check_output("blkdiscard -z --offset 0 --length $((1024*1024)) /dev/vdb") + ssh.check_output("sync && echo 3 > /proc/sys/vm/drop_caches") + # Verify the range now reads as zeros. + ssh.check_output("cmp -n 1048576 /dev/vdb /dev/zero") + + +def test_write_zeroes(uvm_plain_any, microvm_factory, io_engine): + """ + Verify VIRTIO_BLK_F_WRITE_ZEROES on a fresh boot and after snapshot/restore. + + Writes random data to a 1 MiB region of /dev/vdb, then issues + `blkdiscard -z` (BLKZEROOUT ioctl), and asserts that: + - sysfs /queue/write_zeroes_max_bytes is non-zero (feature negotiated) + - the region reads back as zeros after the operation + The same workload is then run against a snapshot-restored VM, which + catches `persist::restore()` populating the write-zeroes ConfigSpace + fields wrong (the restored guest would otherwise see a zero + write_zeroes_max_bytes and the workload would silently regress). + + Note: we deliberately do NOT assert that the `write_zeroes_count` metric + increased. The Linux kernel's `blkdev_issue_zeroout()` may issue either + `REQ_OP_WRITE_ZEROES` (counted) or fall back to plain zero-page writes + (counted as `write_count`) depending on internal heuristics; both + result in the device contents reading as zeros. Direct WRITE_ZEROES + request handling is covered by the unit tests. + """ + vm = uvm_plain_any + vm.spawn() + vm.basic_config() + vm.add_net_iface() + fs = drive_tools.FilesystemFile(os.path.join(vm.fsfiles, "wz_test"), size=64) + vm.add_drive("wz_disk", fs.path, is_read_only=False, io_engine=io_engine) + vm.start() + + _exercise_write_zeroes(vm.ssh) + + snapshot = vm.snapshot_full() + vm = microvm_factory.build_from_snapshot(snapshot) + _exercise_write_zeroes(vm.ssh) + + metrics = vm.flush_metrics() + assert metrics["block"]["execute_fails"] == 0 + + +def test_write_zeroes_not_advertised_for_read_only(uvm_plain_any, io_engine): + """ + Verify VIRTIO_BLK_F_WRITE_ZEROES is NOT advertised for read-only devices. + + The kernel exposes the negotiated feature via + /sys/block//queue/write_zeroes_max_bytes; 0 means not supported. + """ + vm = uvm_plain_any + vm.spawn() + vm.basic_config() + vm.add_net_iface() + + fs = drive_tools.FilesystemFile(os.path.join(vm.fsfiles, "ro_wz_disk"), size=64) + vm.add_drive("ro_wz_disk", fs.path, is_read_only=True, io_engine=io_engine) + vm.start() + + _, stdout, _ = vm.ssh.check_output( + "cat /sys/block/vdb/queue/write_zeroes_max_bytes" + ) + assert stdout.strip() == "0", ( + f"Expected write_zeroes_max_bytes=0 for read-only device, got: {stdout.strip()}" + ) + +