tests: add multi-queue pause/resume during boot regression#187
tests: add multi-queue pause/resume during boot regression#187Coffeeri wants to merge 5 commits intocyberus-technology:mainfrom
Conversation
254d1d9 to
aef370f
Compare
tests/testsuite_default.py
Outdated
| # observed, but still give the VMM time to initialize the domain. | ||
| time.sleep(3) | ||
|
|
||
| status, out = execute_or_fail_on_timeout( |
There was a problem hiding this comment.
I don't get why this is not machine.succseed
There was a problem hiding this comment.
machine.succeed(cmd,timeout=15) does not kill the CHV when stuck. We remain stuck within the tear down of our testsuite.
Now, using machine.succeed within out helper execute_or_fail_on_timeout is not possible, as we want to match the return value to 124/ 125.
There was a problem hiding this comment.
I don't get yet why is necessary to kill cloud hypervisor 🤔 Or is it a safety measure so that the next test run can succeed?
There was a problem hiding this comment.
I’m not entirely sure at which point the teardown process is getting stuck, but I suspect it happens when systemctl restart virtchd is invoked while CHV is hanging.
This would be interesting to further investigate 👍
There was a problem hiding this comment.
Please take a step back: This kill is necessary to prevent a stuck process teardown, aye? are we on the same page?
6a0e60f to
ab7a4f2
Compare
ab7a4f2 to
1d0c68c
Compare
| ) | ||
|
|
||
|
|
||
| def execute_or_fail_on_timeout( |
There was a problem hiding this comment.
nit. execute_or_kill_on_timeout_failure?
tests/testsuite_default.py
Outdated
| # observed, but still give the VMM time to initialize the domain. | ||
| time.sleep(3) | ||
|
|
||
| status, out = execute_or_fail_on_timeout( |
There was a problem hiding this comment.
I don't get yet why is necessary to kill cloud hypervisor 🤔 Or is it a safety measure so that the next test run can succeed?
1d0c68c to
1fc5ee4
Compare
Replace the running-only domain assertion with a shared helper that checks arbitrary `virsh domstate` values, and update the save/restore and migration tests to use it. Additionally, we add a shared command wrapper that turns timeout exit codes into test failures. If CHV hangs, the test hits the timeout but the CHV process can still remain alive. Kill it before raising so teardown does not hang while restarting `virtchd`. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Allow the shared Cloud Hypervisor domain XML generator to override memory size, vCPU count, and virtio queue counts. This makes it possible to build more targeted integration test fixtures without duplicating the base domain definition. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add a regression test for suspend/resume while a guest is still booting with multi-queue virtio-net and virtio-blk devices. This covers the barrier sizing issue fixed in cloud-hypervisor where the pause path could hang if the guest activated fewer queues than were configured. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Replace the inline domstate parsing in test_suspend_resume with the shared paused/running assertion helpers. This keeps the test consistent with the new shared domain-state helpers and removes duplicated shell parsing. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add a live migration test for a guest with virtio multi-queue devices. Pause and resume the guest before migration, then trigger a guest reboot while the migration is in progress. Verify that the guest is reachable on computeVM afterwards and that the source domain is shut off. On-behalf-of: SAP leander.kohler@sap.com Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
1fc5ee4 to
d1f2408
Compare
| f"cloud_hypervisor_pids={killed_chv_pids or '<none>'}\n" | ||
| f"retry_status={retry_status}\n" | ||
| f"retry_output:\n{retry_out}" | ||
| ) |
There was a problem hiding this comment.
Hmm this function assumes that it is fine if there are leftover running CHV instances and we have to delete them. Is that always the case or is that more a sign of some kind of error?
This PR adds a regression test for pause/resume during boot with multi-queue virtio devices, covering the barrier handling issue https://github.com/cobaltcore-dev/cobaltcore/issues/473 fixed in cyberus-technology/cloud-hypervisor#116.
It also cleans up the test code a bit by sharing domain-state assertions, adding a small timeout helper for commands that can hang, and making the domain XML generator configurable for memory, vCPU count, and queue counts.
pipeline