Skip to content

tests: add multi-queue pause/resume during boot regression#187

Open
Coffeeri wants to merge 5 commits intocyberus-technology:mainfrom
Coffeeri:regression/save-resume-barrier
Open

tests: add multi-queue pause/resume during boot regression#187
Coffeeri wants to merge 5 commits intocyberus-technology:mainfrom
Coffeeri:regression/save-resume-barrier

Conversation

@Coffeeri
Copy link
Copy Markdown
Contributor

@Coffeeri Coffeeri commented Mar 25, 2026

This PR adds a regression test for pause/resume during boot with multi-queue virtio devices, covering the barrier handling issue https://github.com/cobaltcore-dev/cobaltcore/issues/473 fixed in cyberus-technology/cloud-hypervisor#116.

It also cleans up the test code a bit by sharing domain-state assertions, adding a small timeout helper for commands that can hang, and making the domain XML generator configurable for memory, vCPU count, and queue counts.

pipeline

@Coffeeri Coffeeri requested review from amphi, hertrste and phip1611 and removed request for phip1611 March 25, 2026 14:23
@Coffeeri Coffeeri marked this pull request as draft March 25, 2026 14:24
@Coffeeri Coffeeri force-pushed the regression/save-resume-barrier branch from 254d1d9 to aef370f Compare March 25, 2026 14:27
Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some remarks

# observed, but still give the VMM time to initialize the domain.
time.sleep(3)

status, out = execute_or_fail_on_timeout(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is not machine.succseed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machine.succeed(cmd,timeout=15) does not kill the CHV when stuck. We remain stuck within the tear down of our testsuite.

Now, using machine.succeed within out helper execute_or_fail_on_timeout is not possible, as we want to match the return value to 124/ 125.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get yet why is necessary to kill cloud hypervisor 🤔 Or is it a safety measure so that the next test run can succeed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not entirely sure at which point the teardown process is getting stuck, but I suspect it happens when systemctl restart virtchd is invoked while CHV is hanging.
This would be interesting to further investigate 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a step back: This kill is necessary to prevent a stuck process teardown, aye? are we on the same page?

@Coffeeri Coffeeri marked this pull request as ready for review March 25, 2026 15:07
@Coffeeri Coffeeri force-pushed the regression/save-resume-barrier branch 4 times, most recently from 6a0e60f to ab7a4f2 Compare March 25, 2026 15:59
@Coffeeri Coffeeri self-assigned this Mar 25, 2026
@Coffeeri Coffeeri force-pushed the regression/save-resume-barrier branch from ab7a4f2 to 1d0c68c Compare March 26, 2026 13:00
Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic!

)


def execute_or_fail_on_timeout(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. execute_or_kill_on_timeout_failure?

# observed, but still give the VMM time to initialize the domain.
time.sleep(3)

status, out = execute_or_fail_on_timeout(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get yet why is necessary to kill cloud hypervisor 🤔 Or is it a safety measure so that the next test run can succeed?

@Coffeeri Coffeeri force-pushed the regression/save-resume-barrier branch from 1d0c68c to 1fc5ee4 Compare March 26, 2026 14:57
Replace the running-only domain assertion with a shared helper that
checks arbitrary `virsh domstate` values, and update the save/restore
and migration tests to use it.

Additionally, we add a shared command wrapper that turns timeout exit
codes into test failures.
If CHV hangs, the test hits the timeout but the CHV process can still
remain alive. Kill it before raising so teardown does not hang while
restarting `virtchd`.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Allow the shared Cloud Hypervisor domain XML generator to override
memory size, vCPU count, and virtio queue counts.

This makes it possible to build more targeted integration test fixtures
without duplicating the base domain definition.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add a regression test for suspend/resume while a guest is still booting
with multi-queue virtio-net and virtio-blk devices.

This covers the barrier sizing issue fixed in cloud-hypervisor where the
pause path could hang if the guest activated fewer queues than were
configured.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Replace the inline domstate parsing in test_suspend_resume with the
shared paused/running assertion helpers.

This keeps the test consistent with the new shared domain-state helpers
and removes duplicated shell parsing.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add a live migration test for a guest with virtio multi-queue devices.

Pause and resume the guest before migration, then trigger a guest
reboot while the migration is in progress. Verify that the guest is
reachable on computeVM afterwards and that the source domain is shut
off.

On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
@Coffeeri Coffeeri force-pushed the regression/save-resume-barrier branch from 1fc5ee4 to d1f2408 Compare March 27, 2026 07:24
f"cloud_hypervisor_pids={killed_chv_pids or '<none>'}\n"
f"retry_status={retry_status}\n"
f"retry_output:\n{retry_out}"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this function assumes that it is fine if there are leftover running CHV instances and we have to delete them. Is that always the case or is that more a sign of some kind of error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants