Skip to content

Add option to skip zero pages during VM migration#112

Closed
arctic-alpaca wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
arctic-alpaca:omit-zero-pages-pr
Closed

Add option to skip zero pages during VM migration#112
arctic-alpaca wants to merge 4 commits intocyberus-technology:gardenlinuxfrom
arctic-alpaca:omit-zero-pages-pr

Conversation

@arctic-alpaca
Copy link
Copy Markdown

@arctic-alpaca arctic-alpaca commented Mar 18, 2026

A VM may have previously unused memory that is still zeroed (or memory zeroed by the guest, but that's more unlikely). This memory doesn't need to be transferred during a migration as the migration destination provides zeroed memory to the VM anyway. This PR adds the option to skip zero pages during migration.

I'll leave this PR as draft until benchmarking was done. Benchmarking is currently blocked on a bug in the timeout logic and will be added as soon as possible. Comments are already welcome.

Benchmarking was done with this branch, which is based on #111.

All benchmarks were done between livemig-dellemc-2tb-1 and livemig-dellemc-2tb-2 and run only once per setup.

Setup Migration time % of no skip
32GiB, 4 vCPU, no memtouch, no skip, 1 connection 47083ms 100%
32GiB, 4vCPU, no memtouch, with skip, 1 connection 2245ms 5%
1TiB, 32vCPU, no memtouch, no skip, 8 connections 90250ms 100%
1TiB, 32vCPU, no memtouch, with skip, 8 connections 44176ms 49%
32GiB, 4 vCPU, with memtouch, no skip, 1 connection 47634ms 100%
32GiB, 4 vCPU, with memtouch, with skip, 1 connection 50437ms 106%
1TiB, 32vCPU, with memtouch, no skip, 8 connections 101556ms 100%
1TiB, 32vCPU, with memtouch, with skip, 8 connections 179236ms 176%

In the 32GiB, 4 vCPU, with memtouch, with skip, 1 connection and 1TiB, 32vCPU, with memtouch, with skip, 8 connections cases, memtouch got OOM killed.

Benchmark commands
  • 32GiB, 4 vCPU, no memtouch, no skip, 1 connection

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=32G --cpus boot=4
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000
  • 32GiB, 4vCPU, no memtouch, with skip, 1 connection

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=32G --cpus boot=4
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --skip-zero-pages
  • 1TiB, 32vCPU, no memtouch, no skip, 8 connections

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=1024G --cpus boot=32
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --connections 8
  • 1TiB, 32vCPU, no memtouch, with skip, 8 connections

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=1024G --cpus boot=32
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --connections 8 --skip-zero-pages
  • 32GiB, 4 vCPU, with memtouch, no skip, 1 connection

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=32G --cpus boot=4
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000
    > memtouch --rw_ratio 100 --thread_mem 8128 --num_threads 4 --once
  • 32GiB, 4 vCPU, with memtouch, with skip, 1 connection

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=32G --cpus boot=4
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --skip-zero-pages
    > memtouch --rw_ratio 100 --thread_mem 8128 --num_threads 4 --once
  • 1TiB, 32vCPU, with memtouch, no skip, 8 connections

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=1024G --cpus boot=32
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --connections 8
    > memtouch --rw_ratio 100 --thread_mem 32512 --num_threads 32 --once
  • 1TiB, 32vCPU, with memtouch, with skip, 8 connections

    > cargo run --release --bin cloud-hypervisor -- --api-socket /tmp/jschindel_chv1.sock --kernel result/linux_6_19.bzImage --cmdline "console=ttyS0" --serial tty --console off --initramfs result/initrd_default --seccomp log -vv --memory size=1024G --cpus boot=32
    > cargo run --release --bin ch-remote -- --api-socket /tmp/jschindel_chv1.sock send-migration tcp:192.168.123.2:7868 --downtime 200 --migration-timeout 12000 --connections 8 --skip-zero-pages
    > memtouch --rw_ratio 100 --thread_mem 32512 --num_threads 32 --once

They numbers are good for unused machines, but don't look great for machines with full memory utilization. The obvious optimization would to split the zero page checking between multiple threads, but I don't want to make this PR more complex. Since the zero page skipping can be toggled, the optimization should only be applied for newly created or not heavily used VMs. Open to suggestions and opinions though.

@arctic-alpaca arctic-alpaca force-pushed the omit-zero-pages-pr branch 3 times, most recently from e6b541c to b50b666 Compare March 18, 2026 09:52
@arctic-alpaca arctic-alpaca marked this pull request as ready for review March 18, 2026 13:38
@olivereanderson
Copy link
Copy Markdown

Setup Migration time % of no skip
32GiB, 4 vCPU, no memtouch, no skip, 1 connection 47083ms 100%
32GiB, 4vCPU, no memtouch, with skip, 1 connection 2245ms 0,05%
1TiB, 32vCPU, no memtouch, no skip, 8 connections 90250ms 100%
1TiB, 32vCPU, no memtouch, with skip, 8 connections 44176ms 0,49%
32GiB, 4 vCPU, with memtouch, no skip, 1 connection 47634ms 100%
32GiB, 4 vCPU, with memtouch, with skip, 1 connection 50437ms 106%
1TiB, 32vCPU, with memtouch, no skip, 8 connections 101556ms 100%
1TiB, 32vCPU, with memtouch, with skip, 8 connections 179236ms 176%

I think you forgot to multiply by 100 for the "with skip" entries.

@arctic-alpaca
Copy link
Copy Markdown
Author

I think you forgot to multiply by 100 for the "with skip" entries.

That's what you get when you try to finish something quickly before a meeting 🤦 Fixed, thanks.

In the `MemoryRangeTable::partition` call, we're now skipping all pages
completely filled with zeroes. This reduces the memory that needs to be
transferred during migration if the VM has zero pages in its memory.

On-behalf-of: SAP julian.schindel@sap.com
Signed-off-by: Julian Schindel <julian.schindel@cyberus-technology.de>
On-behalf-of: SAP julian.schindel@sap.com
Signed-off-by: Julian Schindel <julian.schindel@cyberus-technology.de>
On-behalf-of: SAP julian.schindel@sap.com
Signed-off-by: Julian Schindel <julian.schindel@cyberus-technology.de>
On-behalf-of: SAP julian.schindel@sap.com
Signed-off-by: Julian Schindel <julian.schindel@cyberus-technology.de>
Copy link
Copy Markdown

@amphi amphi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great code and thanks for doing the benchmarks! But I have to admit that I have the feeling that all of this is a lot of complexity for the small advantages we get in only some very special scenarios.

Can you maybe take a look whether we could implement that in vm_send_memory (or somewhere near that)? There we already have the guest_memory, and we could do it multi threaded (if we use multiple TCP connections). Or maybe in some other place, but this feels a bit intrusive (again, for the small win we get).

Sorry for being so negative about this change

Comment on lines +407 to +409
// As far as I can tell, `MemoryRange` should always start and end on page boundaries,
// but there are not type-level guarantees, so we handle page boundaries and overshoot
// to be safe.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really unsure whether we want to silently fix those memory regions if they don't start and end on page boundaries. Maybe just make it an debug_assert so we see it in the tests if this assumption is not correct.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, non-trivial drive-by changes are usually better handled in a dedicated PR! :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no documented invariants, I'd be cautious to break on something we can handle. The lengths of MemoryRanges returned by this iterator don't fit page boundaries for example.

To fix this properly, MemoryRange should enforce the page boundaries, not this method.

I'm not sure what the best way to handle this in this PR is.

Comment on lines 303 to 307
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the reason to remove this comment?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit broken, you deleted

    /// Return the next memory range in the table, making sure that
    /// the returned range is not larger than `chunk_size`.
    ///
    /// **Note**: Do not rely on the order of the ranges returned by this
    /// iterator. This allows for a more efficient implementation.

Over MemoryRangeTableIterator::next. This comment is what my question is about.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to the struct. It's not very visible in the trait method implementation and overrides the existing documentation of the trait method.

@arctic-alpaca
Copy link
Copy Markdown
Author

Can you maybe take a look whether we could implement that in vm_send_memory (or somewhere near that)? There we already have the guest_memory, and we could do it multi threaded (if we use multiple TCP connections). Or maybe in some other place, but this feels a bit intrusive (again, for the small win we get).

Initially I was looking into doing this deeper in the call stack, but was under the impression that iterating over the complete MemoryRangeTable multiple times (after the MemoryRangeTable::partition call) would be problematic. With the benchmarks, I can now look properly into the tradeoffs, will do so 👍

Sorry for being so negative about this change

No worries 😃

Copy link
Copy Markdown

@olivereanderson olivereanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. This looks quite reasonable to me 👍

I notice that the benchmarks are all done with parameters/data I expect to be edge cases/outliers.
In other words both 0% writes and 100% writes is a bit absurd.

It would be interesting to see memtouch numbers with:

  • 2/3 reads (1/3 writes)
  • 1/2 reads (1/2 writes)
  • 1/3 reads (2/3 writes)

As I expect that to be a better approximation of most realistic workloads.

Comment on lines +358 to +363
/// Removes all-zero-pages from [`MemoryRangeTableIterator::data`] and populates
/// [`MemoryRangeTableIterator::zero_removed_data`] with the non-zero-pages.
///
/// # Panics
///
/// Panics if a memory range is not valid for [`MemoryRangeTableIterator::guest_memory`].
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a line to the documentation explaining what the returned bool means.

}

for page_start in
(0..page_amount).map(|page_index| page_index * page_size_u64 + first_page_boundary)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I would be curious to know whether the compiler inlines this function and replaces the multiplication with a shift (the page size is a power of two).

@arctic-alpaca
Copy link
Copy Markdown
Author

In other words both 0% writes and 100% writes is a bit absurd.

I agree for 0%, but I'm not sure that 100% is as absurd. This doesn't indicate that 100% of memory are in use currently, rather that every page of the memory has been written to during the lifetime of the VM and not been zeroed again. But that's just my assumption, I have nothing to back this up.

@olivereanderson
Copy link
Copy Markdown

In other words both 0% writes and 100% writes is a bit absurd.

I agree for 0%, but I'm not sure that 100% is as absurd. This doesn't indicate that 100% of memory are in use currently, rather that every page of the memory has been written to during the lifetime of the VM and not been zeroed again.

You are right that calling that absurd might be a bit exaggerated, but I still don't think it will be that common.

Feel free to keep the 0% and 100% cases, but it would still be useful to see the numbers with the parameters I suggested.

@arctic-alpaca arctic-alpaca marked this pull request as draft March 20, 2026 08:54
@arctic-alpaca
Copy link
Copy Markdown
Author

arctic-alpaca commented Mar 20, 2026

Redrafting and going to open a new PR where the zero-page scanning happens in vm_send_memory. I need a bit of time for benchmarking and more in-depth testing, but initial numbers look promising.

@arctic-alpaca
Copy link
Copy Markdown
Author

Superseded by #117.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants