Directnic nvidia stable 10.1#7
Directnic nvidia stable 10.1#7tdavenvidia wants to merge 7 commits intoNVIDIA:nvidia_stable-10.1from
Conversation
Nvidia’s next generation GB200 platform has Blackwell GPU and CX8 directly
connected through PCIe Gen6 x16 link. Direct P2P PCIe traffic between GPU
and NIC is possible however it requires ATS at its core and Grace CPU does
not support PCIe ATS. GPA=HPA solution removes the need for GPA to HPA
address translation by configuring PCIe BARs in the VM with HPA. It also
enables ACPI PCI DSM by setting ‘preserve_config’ to true to avoid VM from
reconfiguring the PCI BARs during boot.
Here is the example of PCIe topology that shows GPU and CX8 behind the PCIe Switch:
$ lspci -vt
-[0000:00]---00.0-[01-07]----00.0-[02-07]--+-00.0-[03]--+-00.0 Mellanox Technologies CX8 Family [ConnectX-8]
| \-00.1 Mellanox Technologies CX8 Family [ConnectX-8]
\-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]--
\-0c.0-[07]--
-[0002:00]---00.0-[01-07]----00.0-[02-07]--+-00.0-[03]--+-00.0 Mellanox Technologies CX8 Family [ConnectX-8]
| \-00.1 Mellanox Technologies CX8 Family [ConnectX-8]
\-01.0-[04-07]----00.0-[05-07]--+-08.0-[06]--
\-0c.0-[07]--
-[0005:00]---00.0-[01-0a]----00.0-[02-0a]--+-01.0-[03]--
+-02.0-[04]--
+-03.0-[05]--
+-04.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family
+-05.0-[08]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
+-06.0-[09]----00.0 Intel Corporation I210 Gigabit Network Connection
\-07.0-[0a]--
-[0006:00]---00.0-[01-09]----00.0-[02-09]--+-00.0-[03]--+-00.0 Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller
| +-00.1 Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller
| \-00.2 Mellanox Technologies MT43244 BlueField-3 SoC Management Interface
\-02.0-[04-09]----00.0-[05-09]--+-00.0-[06]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
+-04.0-[07]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
+-08.0-[08]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
\-0c.0-[09]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
-[0008:00]---00.0-[01-06]----00.0-[02-06]--+-00.0-[03]----00.0 Mellanox Technologies Device 2100
\-03.0-[04-06]----00.0-[05-06]----00.0-[06]----00.0 NVIDIA Corporation Device 2941
-[0009:00]---00.0-[01-06]----00.0-[02-06]--+-00.0-[03]----00.0 Mellanox Technologies Device 2100
\-01.0-[04-06]----00.0-[05-06]----00.0-[06]----00.0 NVIDIA Corporation Device 2941
-[0010:00]---00.0-[01-07]----00.0-[02-07]--+-00.0-[03]--+-00.0 Mellanox Technologies CX8 Family [ConnectX-8]
| \-00.1 Mellanox Technologies CX8 Family [ConnectX-8]
\-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]--
\-0c.0-[07]--
-[0012:00]---00.0-[01-07]----00.0-[02-07]--+-00.0-[03]--+-00.0 Mellanox Technologies CX8 Family [ConnectX-8]
| \-00.1 Mellanox Technologies CX8 Family [ConnectX-8]
\-01.0-[04-07]----00.0-[05-07]--+-08.0-[06]--
\-0c.0-[07]--
-[0015:00]---00.0-[01]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
-[0016:00]---00.0-[01-09]----00.0-[02-09]--+-00.0-[03]--+-00.0 Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller
| +-00.1 Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller
| \-00.2 Mellanox Technologies MT43244 BlueField-3 SoC Management Interface
\-02.0-[04-09]----00.0-[05-09]--+-00.0-[06]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
+-04.0-[07]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
+-08.0-[08]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
\-0c.0-[09]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
-[0018:00]---00.0-[01-06]----00.0-[02-06]--+-00.0-[03]----00.0 Mellanox Technologies Device 2100
\-03.0-[04-06]----00.0-[05-06]----00.0-[06]----00.0 NVIDIA Corporation Device 2941
-[0019:00]---00.0-[01-06]----00.0-[02-06]--+-00.0-[03]----00.0 Mellanox Technologies Device 2100
\-01.0-[04-06]----00.0-[05-06]----00.0-[06]----00.0 NVIDIA Corporation Device 2941
GPA=HPA is expected to work with PCIe topology in the VM that resembles to
baremetal. In other words, for P2P PCIe traffic (using GPA=HPA) over Gen6,
CX8 NIC(the DMA-PF) and GPU assigned to VM should be under the same PCIe switch.
Note: PCIe Switch needs special non-conventional ACS configuration such that
minimal P2P routes needed for GPU Direct RDMA should be allowed.
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Tushar Dave <tdave@nvidia.com>
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Grace Blackwell GPU PCIe BAR1 is real BAR exposed to VM that can be
used for GPUdirect RDMA [1].
This patch assigns HPA to BAR1 in the VM for the reason mentioned in
the commit 54db2e4a632 ("hw/arm: GB200 DirectNIC GPA=HPA").
This patch also assigns appropriate GPA to GPU BAR2 (exposed to VM with
the same size as BAR 1 that emulates C2C cache coherent address space)
to avoid region conflict in PCI bus resource assignment.
[1]: https://lore.kernel.org/lkml/20241006102722.3991-1-ankita@nvidia.com/
Signed-off-by: Tushar Dave <tdave@nvidia.com>
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Simialr to GB200, GB300 also requires workaround to make GPU BAR 1 GPA=HPA. Signed-off-by: Tushar Dave <tdave@nvidia.com>
…e ports When PASID capable device is added behind the PCIe downstream port, for example Nvidia GPU, the PCIe downstream ports must expose ACS capability otherwise PASID won't get enabled. In addition, the other usecase is GPUDirect RDMA using Data Direct that must require special ACS controls at the PCIe downstream ports for P2P communication. Signed-off-by: Tushar Dave <tdave@nvidia.com>
To support P2P on Guest we must expose to the guest OS the actual PCIe topology and configuration as set by the HYP. Otherwise, the behavior is considered as un-defined. It might fail by SW or HW. Extend both root port and downstream port to get acs caps that should match the HYP and use them in the guest. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Tushar Dave <tdave@nvidia.com>
GPUDirect RDMA uisng data-direct requires a specific ACS configuration on PCIe Root Ports and Downstream Ports. While ACS can be configured via QEMU's 'acs-caps' property, the guest kernel may overwrite ACS during standard programming. This change blocks all guest writes to the PCIe ACS Control register and preserves QEMU-provided ACS settings across device resets on PCIe Root Ports and Downstream Ports. Signed-off-by: Tushar Dave <tdave@nvidia.com>
Testing command for GB200/GB300 GPUDirect RDMA usign Nvidia GPU, CX8 and Data Direct Interface:
qemu-system-aarch64 \
-object iommufd,id=iommufd0 \
-machine hmat=on -machine virt,accel=kvm,gic-version=3,ras=on,grace-pcie-mmio-identity=on,highmem-mmio-size=4T \
-cpu host -smp cpus=16 -m size=16G,slots=2,maxmem=256G -nographic \
-object memory-backend-ram,size=8G,id=m0 \
-object memory-backend-ram,size=8G,id=m1 \
-numa node,memdev=m0,cpus=0-15,nodeid=0 -numa node,memdev=m1,nodeid=1 \
-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5\
-numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9\
-device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on,ats=on,ril=off,pasid=on,oas=48,cmdqv=on \
-device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0,acs-caps=0x1C \
-device x3130-upstream,id=upstream1,bus=pcie.port1 \
-device xio3130-downstream,id=downstream1_1,bus=upstream1,chassis=1,slot=1,acs-caps=0x19 \
-device vfio-pci,host=0018:03:00.0,bus=downstream1_1,id=dmapf1,iommufd=iommufd0 \
-device xio3130-downstream,id=downstream1_2,bus=upstream1,chassis=1,slot=2,acs-caps=0x15 \
-device vfio-pci-nohotplug,host=0018:06:00.0,bus=downstream1_2,rombar=0,id=dev0,iommufd=iommufd0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
-bios /usr/share/AAVMF/AAVMF_CODE.fd \
-device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \
-drive file=/home/nvidia/tushar/tushar/ubuntu-24.04-server-cloudimg-arm64-grace-6.14.0-1007-nvidia-64k.qcow2,index=0,media=disk,format=qcow2,if=none,id=nvme0 \
-device e1000,netdev=net0,bus=pcie.0 \
-device pxb-pcie,id=pcie.9,bus_nr=9,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.9,id=smmuv3.2,accel=on,ats=on,ril=off,pasid=on,oas=48,cmdqv=on \
-device pcie-root-port,id=pcie.port9,bus=pcie.9,chassis=4,io-reserve=0 \
-device x3130-upstream,id=upstream9,bus=pcie.port9 \
-device xio3130-downstream,id=downstream9_1,bus=upstream9,chassis=4,slot=1 \
-device vfio-pci,host=0012:03:00.1,bus=downstream9_1,id=nic1,iommufd=iommufd0 \
-netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586 \
Signed-off-by: Tushar Dave <tdave@nvidia.com>
--
nvmochs
left a comment
There was a problem hiding this comment.
I confirmed that these commits match what I had previously reviewed internally. I don't think it makes sense to keep the README since end-users will be relying on documentation rather than a commit message - I plan on dropping that when merging.
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
|
I understand this is a rebase of an existing solution onto this branch and that it is highly NVIDIA specific. Given that, I haven’t done a thorough review at this point. I’m fine proceeding based on the test and verification results, and if those look good, please go ahead |
MitchellAugustin
left a comment
There was a problem hiding this comment.
Thanks all - I left a few questions that I'd like your input on, as well as some minor change requests.
Since this is a somewhat large delta not intended for upstream, I'd say my main focus is to ensure that any potentially user-visible differences are well documented in relevant parts of the source and in Nvidia's end-user documentation, if it is expected that the behavior for some common action might differ between our qemu's behavior and mainline.
I only sparsely commented areas that were primarily concentrated with very hardware-specific changes (such as specific address mappings that I don't have the full context on), so lmk if you would like additional input from me on any of those areas.
| pbar[idx].end = pbar[idx].addr + dev->io_regions[idx].size - 1; | ||
| } | ||
|
|
||
| /* Make sure BAR1 gets GPA=HPA, adjust other two BARs accordingly to avoind region conflict */ |
|
|
||
| /* Make sure BAR1 gets GPA=HPA, adjust other two BARs accordingly to avoind region conflict */ | ||
| overlap = true; | ||
| while (overlap) { |
There was a problem hiding this comment.
I'm wondering if there's any opportunity / benefit of optimizing these nested loops - perhaps by tracking the highest address of the new highest shifted BAR within an iteration, and then using that as the alignment input for the pbar[j]? (if I understand correctly, this outermost while(overlap) is only needed since a given BAR could theoretically be adjusted multiple times, which I think that might alleviate.)
As-is, I don't think this would cause a performance issue since I don't see any expensive operations here (and since AFAIK we only have a single-digit number of BARS we're iterating through), but it seems like something that could require optimization if any expensive operations are ever added to these inner loops in the future.
WDYT, worthwhile optimization or not necessary here?
There was a problem hiding this comment.
This GPA=HPA changes will be updated with more generic solution - I will address any optimization then. Right now it looks good IMO.
| 4); | ||
| } | ||
|
|
||
| static void nvidia_dev_vfio(PCIBus *bus, PCIDevice *dev, void *opaque) |
There was a problem hiding this comment.
Would you mind adding a comment to the code here to briefly describe the format of what is being read from the file descriptors later in this function, and how that content is being used in the context of this function?
There was a problem hiding this comment.
That inconsistency is stylistic; there’s no semantic difference. I’ll unify the loop bounds for clarity in when I make generic solution. would that be okay?
There was a problem hiding this comment.
Assuming this reply was meant for my above comment - in this thread, I was curious about the syntax of the data read from {vdev->vbasedev.sysfsdev}/resource here on line 292.
There was a problem hiding this comment.
My comment about, "That inconsistency is stylistic; ..." was about inner and outer for loop comment that you asked.
Would you mind adding a comment to the code here to briefly describe the format of what is being read from the file descriptors later in this function, and how that content is being used in the context of this function?
It’s the sysfs path string for the VFIO device. In QEMU’s VFIO stack, VFIODevice has a char *sysfsdev field that points to the kernel sysfs entry of the host device (e.g., /sys/bus/pci/devices/0009:06:00.0). I don't think we need comment here.
| /* Try with the extended parent window */ | ||
| ncfg->rbase = QEMU_ALIGN_UP(ncfg->wlimit64 + 1, ncfg->rsize); | ||
| ncfg->wlimit64 = ncfg->rbase + ncfg->rsize - 1; | ||
| /* TODO: check conflicts with the extended window */ |
There was a problem hiding this comment.
Is this TODO anything we should be concerned about within the scope of this PR, or for a future PR and expected to remain a TODO at this stage?
There was a problem hiding this comment.
No that TODO is not necessary.
tdavenvidia
left a comment
There was a problem hiding this comment.
Thanks for review, I replied to your comments, let me know.
| } NVIDIACfg; | ||
|
|
||
| #define IORESOURCE_PREFETCH 0x00002000 /* No side effects */ | ||
| #define IORESOURCE_MEM_64 0x00100000 |
There was a problem hiding this comment.
Sure, I will remove IORESOURCE_MEM_64
|
|
||
| /* Make sure BAR1 gets GPA=HPA, adjust other two BARs accordingly to avoind region conflict */ | ||
| overlap = true; | ||
| while (overlap) { |
There was a problem hiding this comment.
This GPA=HPA changes will be updated with more generic solution - I will address any optimization then. Right now it looks good IMO.
| 4); | ||
| } | ||
|
|
||
| static void nvidia_dev_vfio(PCIBus *bus, PCIDevice *dev, void *opaque) |
There was a problem hiding this comment.
That inconsistency is stylistic; there’s no semantic difference. I’ll unify the loop bounds for clarity in when I make generic solution. would that be okay?
| /* Try with the extended parent window */ | ||
| ncfg->rbase = QEMU_ALIGN_UP(ncfg->wlimit64 + 1, ncfg->rsize); | ||
| ncfg->wlimit64 = ncfg->rbase + ncfg->rsize - 1; | ||
| /* TODO: check conflicts with the extended window */ |
There was a problem hiding this comment.
No that TODO is not necessary.
| cap_bits = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | | ||
| PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT; | ||
|
|
||
| if (ctrl_bits & ~cap_bits) { |
There was a problem hiding this comment.
Thank you for your thorough review :)
| pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits); | ||
|
|
||
| if (is_downstream && p->acs_caps) { | ||
| /* Block guest writes to ACS Control entirely to preserve QEMU ACS settings */ |
There was a problem hiding this comment.
Mainly for my understanding: If this is only needed for GPUDirect RDMA, is it necessary to make this change for all PCI devices (as opposed to just NICs and GPUs?)
We are adding 'acs_caps' as a generic option, right? ACS is per device primary applicable RP and Downstream Ports , and so 'acs_caps'.
I'd also like to have a better understanding of whether you think this kind of change might cause any unexpected end-user-visible behavior when doing passthrough operations,
Admin who is launching QEMU and using ACS should know what they are doing. Like kernel which allows to use 'config_acs' kernel parameter , acs_caps serves the same purpose, for example in this case to allow p2p between pcie devices e.g. GPU and NIC.
Also, making the ACS read-only to the guest seems like something that might be useful to other hardware, and also might be nice to have as an explicit qemu configuration option rather than being implemented in this function - are there plans to make this a more general configurable in that future upstream PR?
The acs_caps is generic option for all PCIe.
FWIW, here is the Testing command for GB200/GB300 GPUDirect RDMA using Nvidia GPU, CX8 and Data Direct Interface:
qemu-system-aarch64 \
-object iommufd,id=iommufd0 \
-machine hmat=on -machine virt,accel=kvm,gic-version=3,ras=on,grace-pcie-mmio-identity=on,highmem-mmio-size=4T \
-cpu host -smp cpus=16 -m size=16G,slots=2,maxmem=256G -nographic \
-object memory-backend-ram,size=8G,id=m0 \
-object memory-backend-ram,size=8G,id=m1 \
-numa node,memdev=m0,cpus=0-15,nodeid=0 -numa node,memdev=m1,nodeid=1 \
-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5\
-numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9\
-device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on,ats=on,ril=off,pasid=on,oas=48,cmdqv=on \
-device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0,acs-caps=0x1C \
-device x3130-upstream,id=upstream1,bus=pcie.port1 \
-device xio3130-downstream,id=downstream1_1,bus=upstream1,chassis=1,slot=1,acs-caps=0x19 \
-device vfio-pci,host=0018:03:00.0,bus=downstream1_1,id=dmapf1,iommufd=iommufd0 \
-device xio3130-downstream,id=downstream1_2,bus=upstream1,chassis=1,slot=2,acs-caps=0x15 \
-device vfio-pci-nohotplug,host=0018:06:00.0,bus=downstream1_2,rombar=0,id=dev0,iommufd=iommufd0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
-bios /usr/share/AAVMF/AAVMF_CODE.fd \
-device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \
-drive file=<IMG.qcow2>,index=0,media=disk,format=qcow2,if=none,id=nvme0 \
-device e1000,netdev=net0,bus=pcie.0 \
-device pxb-pcie,id=pcie.9,bus_nr=9,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.9,id=smmuv3.2,accel=on,ats=on,ril=off,pasid=on,oas=48,cmdqv=on \
-device pcie-root-port,id=pcie.port9,bus=pcie.9,chassis=4,io-reserve=0 \
-device x3130-upstream,id=upstream9,bus=pcie.port9 \
-device xio3130-downstream,id=downstream9_1,bus=upstream9,chassis=4,slot=1 \
-device vfio-pci,host=0012:03:00.1,bus=downstream9_1,id=nic1,iommufd=iommufd0 \
-netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586 \
| } | ||
|
|
||
| void pcie_acs_reset(PCIDevice *dev) | ||
| void pcie_acs_reset(PCIDevice *dev, uint16_t val) |
There was a problem hiding this comment.
"val" refers to as a PCIe ACS Register value, so IMO it is fine.
MitchellAugustin
left a comment
There was a problem hiding this comment.
Thanks the quick responses to my review comments/questions. This all looks good to me now.
Acked-by: Mitchell Augustin <mitchell.augustin@canonical.com>
|
Merged and will be present in 1:10.1.0+nvidia5-1 (and later). Closing PR. |
Commit e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory
region") made the name member of MemoryRegion unset, causing a NULL
pointer dereference[1]:
> Thread 2 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0 0x00007ffff56565e2 in __strcmp_evex () at /lib64/libc.so.6
> #1 0x0000555555841bdb in find_fd (head=0x5555572337d0 <cpr_state>,
> name=0x0, id=0) at ../migration/cpr.c:68
> #2 cpr_delete_fd (name=name@entry=0x0, id=id@entry=0) at
> ../migration/cpr.c:77
> #3 0x000055555582290a in qemu_ram_free (block=0x7ff7e93aa7f0) at
> ../system/physmem.c:2615
> #4 0x000055555581ae02 in memory_region_finalize (obj=<optimized out>)
> at ../system/memory.c:1816
> #5 0x0000555555a70ab9 in object_deinit (obj=<optimized out>,
> type=<optimized out>) at ../qom/object.c:715
> #6 object_finalize (data=0x7ff7e936eff0) at ../qom/object.c:729
> #7 object_unref (objptr=0x7ff7e936eff0) at ../qom/object.c:1232
> #8 0x0000555555814fae in memory_region_unref (mr=<optimized out>) at
> ../system/memory.c:1848
> #9 flatview_destroy (view=0x555559ed6c40) at ../system/memory.c:301
> #10 0x0000555555bfc122 in call_rcu_thread (opaque=<optimized out>) at
> ../util/rcu.c:324
> #11 0x0000555555bf17a7 in qemu_thread_start (args=0x555557b99520) at
> ../util/qemu-thread-posix.c:393
> #12 0x00007ffff556f464 in start_thread () at /lib64/libc.so.6
> #13 0x00007ffff55f25ac in __clone3 () at /lib64/libc.so.6
The intention of the aforementioned commit is to prevent a MemoryRegion
from parenting itself while its references is counted indendependently
of the device. To achieve the same goal, add a type of QOM objects that
count references and parent MemoryRegions.
[1] https://lore.kernel.org/qemu-devel/4eb93d7a-1fa9-4b3c-8ad7-a2eb64f025a0@collabora.com/
Cc: qemu-stable@nongnu.org
Fixes: e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory region")
Fixes: be88ad4 ("virtio-gpu-virgl: correct parent for blob memory region") for 10.2.x
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Tested-by: Joelle van Dyne <j@getutm.app>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20260214-region-v1-1-229f00ae1f38@rsg.ci.i.u-tokyo.ac.jp>
(cherry picked from commit b2a279094c3b86667969cc645f7fb1087e08dd19)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The test case in the ppe42 functional test triggers a TCG debug assertion, which causes the test to fail in an --enable-debug build or when the sanitizers are enabled: #6 0x00007ffff4a3b517 in __assert_fail (assertion=0x5555562e7589 "!temp_readonly(ots)", file=0x5555562e5b23 "../../tcg/tcg.c", line=4928, function=0x5555562e8900 <__PRETTY_FUNCTION__.23> "tcg_reg_alloc_mov") at ./assert/assert.c:105 #7 0x0000555555cc2189 in tcg_reg_alloc_mov (s=0x7fff60000b70, op=0x7fff600126f8) at ../../tcg/tcg.c:4928 #8 0x0000555555cc74e0 in tcg_gen_code (s=0x7fff60000b70, tb=0x7fffa802f540, pc_start=4294446080) at ../../tcg/tcg.c:6667 #9 0x0000555555d02abe in setjmp_gen_code (env=0x555556cbe610, tb=0x7fffa802f540, pc=4294446080, host_pc=0x7fffeea00c00, max_insns=0x7fffee9f9d74, ti=0x7fffee9f9d90) at ../../accel/tcg/translate-all.c:257 #10 0x0000555555d02d75 in tb_gen_code (cpu=0x555556cba590, s=...) at ../../accel/tcg/translate-all.c:325 #11 0x0000555555cf5922 in cpu_exec_loop (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:970 #12 0x0000555555cf5aae in cpu_exec_setjmp (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:1016 #13 0x0000555555cf5b4b in cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/cpu-exec.c:1042 #14 0x0000555555d1e7ab in tcg_cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/tcg-accel-ops.c:82 #15 0x0000555555d1ff97 in rr_cpu_thread_fn (arg=0x555556cba590) at ../../accel/tcg/tcg-accel-ops-rr.c:285 #16 0x00005555561586c9 in qemu_thread_start (args=0x555556ee3c90) at ../../util/qemu-thread-posix.c:393 #17 0x00007ffff4a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447 #18 0x00007ffff4b29c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 This can be reproduced "by hand": ./build/clang/qemu-system-ppc -display none -vga none \ -machine ppe42_machine -serial stdio \ -device loader,file=$HOME/.cache/qemu/download/03c1ac0fb7f6c025102a02776a93b35101dae7c14b75e4eab36a337e39042ea8 \ -device loader,addr=0xfff80040,cpu-num=0 (assuming you have the image file from the functional test in your local cache). This happens for this input: IN: 0xfff80c00: 07436004 .byte 0x07, 0x43, 0x60, 0x04 which generates (among other things): not_i32 $0x80000,$0x80000 which the TCG optimization pass turns into: mov_i32 $0x80000,$0xfff7ffff dead: 1 pref=0xffff and where we then assert because we tried to write to a constant. This happens for the CLRBWIBC instruction which ends up in do_mask_branch() with rb_is_gpr false and invert true. In this case we will generate code that sets mask to a tcg_constant_tl() but then uses it as the LHS in tcg_gen_not_tl(). Fix the assertion by doing the invert in the translate time C code for the "mask is constant" case. Cc: qemu-stable@nongnu.org Fixes: f7ec91c ("target/ppc: Add IBM PPE42 special instructions") Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Glenn Miles <milesg@linux.ibm.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/qemu-devel/20260212150753.1749448-1-peter.maydell@linaro.org Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com> (cherry picked from commit 78c6b6010ce7cfa54874dda514e694640b76f1e4) Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Details: https://gitlab.com/qemu-project/qemu/-/issues/3144 The function schedule_next_request is called with tg->lock held and it may call throttle_group_co_restart_queue, which takes tgm->throttled_reqs_lock, qemu_co_mutex_lock may leave current coroutine if other iothread has taken the lock. If the next coroutine will call throttle_group_co_io_limits_intercept - it will try to take the mutex tg->lock which will never be released. Here is the backtrace of the iothread: Thread 30 (Thread 0x7f8aad1fd6c0 (LWP 24240) "IO iothread2"): #0 futex_wait (futex_word=0x5611adb7d828, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x5611adb7d828, private=0) at lowlevellock.c:49 #2 0x00007f8ab5a97501 in lll_mutex_lock_optimized (mutex=0x5611adb7d828) at pthread_mutex_lock.c:48 #3 ___pthread_mutex_lock (mutex=0x5611adb7d828) at pthread_mutex_lock.c:93 #4 0x00005611823f5482 in qemu_mutex_lock_impl (mutex=0x5611adb7d828, file=0x56118289daca "../block/throttle-groups.c", line=372) at ../util/qemu-thread-posix.c:94 #5 0x00005611822b0b39 in throttle_group_co_io_limits_intercept (tgm=0x5611af1bb4d8, bytes=4096, direction=THROTTLE_READ) at ../block/throttle-groups.c:372 #6 0x00005611822473b1 in blk_co_do_preadv_part (blk=0x5611af1bb490, offset=15972311040, bytes=4096, qiov=0x7f8aa4000f98, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1354 #7 0x0000561182247fa0 in blk_aio_read_entry (opaque=0x7f8aa4005910) at ../block/block-backend.c:1619 #8 0x000056118241952e in coroutine_trampoline (i0=-1543497424, i1=32650) at ../util/coroutine-ucontext.c:175 #9 0x00007f8ab5a56f70 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66 from target:/lib64/libc.so.6 #10 0x00007f8aad1ef190 in ?? () #11 0x0000000000000000 in ?? () The lock is taken in line 386: (gdb) p tg.lock $1 = {lock = {__data = {__lock = 2, __count = 0, __owner = 24240, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\260^\000\000\001", '\000' <repeats 26 times>, __align = 2}, file = 0x56118289daca "../block/throttle-groups.c", line = 386, initialized = true} The solution is to use tg->lock to protect both ThreadGroup fields and ThrottleGroupMember.throttled_reqs. It doesn't seem to be possible to use separate locks because we need to first manipulate ThrottleGroup fields, then schedule next coroutine using throttled_reqs and after than update token field from ThrottleGroup depending on the throttled_reqs state. Signed-off-by: Dmitry Guryanov <dmitry.guryanov@gmail.com> Message-ID: <20251208085528.890098-1-dmitry.guryanov@gmail.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> (cherry picked from commit d4816177654d59e26ce212c436513f01842eb410) Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
The current implementation of qdev_get_printable_name() sometimes
returns a string that must not be freed (vdev->id or the fixed
fallback string "<unknown device>" and sometimes returns a string
that must be freed (the return value of qdev_get_dev_path()). This
forces callers to leak the string in the "must be freed" case.
Make the function consistent that it always returns a string that
the caller must free, and make the three callsites free it.
This fixes leaks like this that show up when running "make check"
with the address sanitizer enabled:
Direct leak of 13 byte(s) in 1 object(s) allocated from:
#0 0x5561de21f293 in malloc (/home/pm215/qemu/build/san/qemu-system-i386+0x1a2d293) (BuildId: 6d6fad7130fd5c8dbbc03401df554f68b8034936)
#1 0x767ad7a82ac9 in g_malloc (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x62ac9) (BuildId: 116e142b9b52c8a4dfd403e759e71ab8f95d8bb3)
#2 0x5561deaf34f2 in pcibus_get_dev_path /home/pm215/qemu/build/san/../../hw/pci/pci.c:2792:12
#3 0x5561df9d8830 in qdev_get_printable_name /home/pm215/qemu/build/san/../../hw/core/qdev.c:431:24
#4 0x5561deebdca2 in virtio_init_region_cache /home/pm215/qemu/build/san/../../hw/virtio/virtio.c:298:17
#5 0x5561df05f842 in memory_region_write_accessor /home/pm215/qemu/build/san/../../system/memory.c:491:5
#6 0x5561df05ed1b in access_with_adjusted_size /home/pm215/qemu/build/san/../../system/memory.c:567:18
#7 0x5561df05e3fa in memory_region_dispatch_write /home/pm215/qemu/build/san/../../system/memory.c
#8 0x5561df0aa805 in address_space_stm_internal /home/pm215/qemu/build/san/../../system/memory_ldst.c.inc:85:13
#9 0x5561df0bcad3 in qtest_process_command /home/pm215/qemu/build/san/../../system/qtest.c:480:13
Cc: qemu-stable@nongnu.org
Fixes: e209d4d ("virtio: improve virtqueue mapping error messages")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20260307155046.3940197-3-peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
(cherry picked from commit 1e3e1d51e20e8b38efa089bf54b5ee2cbbcca221)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
This PR adds DirectNIC QEMU support to nvidia_stable-10.1.