Skip to content

NULL pointer dereference in uvm_hmm_unregister_gpu during CUDA UVM VA space teardown on process exit #1082

@0x010A13D7

Description

@0x010A13D7

NVIDIA Open GPU Kernel Modules Version

595.58.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Note: The system runs exclusively with the open kernel driver (USE=kernel-open on Gentoo/Pentoo). The proprietary driver has not been tested with this specific crash scenario. However, related HMM issues (see #901) have been confirmed open-driver-specific, and the crash occurs at an offset inside nvidia_uvm where HMM state is accessed.

Operating System and Version

Pentoo Linux (Gentoo-based), OpenRC

Kernel Release

6.19.9-pentoo (custom build, 2026-03-26, PREEMPT(voluntary))

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3080 Mobile (GA104, PCI ID 10de:24a0, PCI slot 0000:01:00.0)
System: System76 Oryx Pro, Intel Core i7-12700H + RTX 3080 Mobile hybrid graphics (Optimus/PRIME), BIOS 2022-07-20_ae6aa72

Describe the bug

A kernel NULL pointer dereference occurs in uvm_hmm_unregister_gpu+0x40 inside nvidia_uvm during UVM VA space teardown when a CUDA worker thread exits. The fault address is 0x00000000000000a0 (NULL + 0xa0). RAX is 0x0 at the point of the fault; the faulting instruction attempts to read from [rax+0xa0].

The crashing thread (cuda00001800007, PID 180310, UID 1000) was blocked in sys_poll (userspace ORIG_RAX=7) when it received a signal and was killed, triggering do_exit → UVM VA space cleanup → the crash.

The driver version (595.58.03, built 2026-03-26 11:47) had been running without issue for approximately 85 minutes before the crash. The CUDA workload was active during that entire period.

Adding options nvidia_uvm uvm_disable_hmm=1 to modprobe configuration prevents the crash, confirming the fault is in the HMM (Heterogeneous Memory Management) code path of nvidia_uvm.

Crash dump (recovered from EFI pstore)

[ 5152.818503][T180310] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 5152.818509][T180310] #PF: supervisor read access in kernel mode
[ 5152.818510][T180310] #PF: error_code(0x0000) - not-present page
[ 5152.818512][T180310] PGD 0 P4D 0 
[ 5152.818514][T180310] Oops: Oops: 0000 [#1] SMP NOPTI
[ 5152.818517][T180310] CPU: 4 UID: 1000 PID: 180310 Comm: cuda00001800007 Tainted: G S         OE       6.19.9-pentoo #1 PREEMPT(voluntary)
[ 5152.818519][T180310] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 5152.818520][T180310] Hardware name: System76 Oryx Pro/Oryx Pro, BIOS 2022-07-20_ae6aa72 07/20/2022
[ 5152.818521][T180310] RIP: 0010:uvm_hmm_unregister_gpu+0x40/0x360 [nvidia_uvm]
[ 5152.818540][T180310] Code: ec 30 48 89 54 24 08 e8 7e fb ff ff 88 44 24 1f 84 c0 0f 84 be 01 00 00 48 8b 45 00 4c 8b a5 98 00 00 00 48 8b 80 00 06 00 00 <4c> 03 a0 a0 00 00 00 4c 89 e0 4c 03 a5 90 00 00 00 48 c1 e8 0c 49
[ 5152.818542][T180310] RSP: 0018:ffffcbcaae95bb00 EFLAGS: 00010202
[ 5152.818543][T180310] RAX: 0000000000000000 RBX: ffffcbca8b8fd008 RCX: ffffcbcaae95bbc0
[ 5152.818544][T180310] RDX: 0000000000000000 RSI: ffff88f954a36000 RDI: 0000000000000000
[ 5152.818545][T180310] RBP: ffff88f954a36000 R08: 0000000000000000 R09: 0000000000000000
[ 5152.818546][T180310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 5152.818547][T180310] R13: ffffcbca8b8fd078 R14: 0000000000000000 R15: ffffcbca8b8fd0a8
[ 5152.818548][T180310] FS:  0000000000000000(0000) GS:ffff890114cb1000(0000) knlGS:0000000000000000
[ 5152.818549][T180310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5152.818550][T180310] CR2: 00000000000000a0 CR3: 0000000434220000 CR4: 0000000000f52ef0
[ 5152.818553][T180310] PKRU: 55555554
[ 5152.818553][T180310] Call Trace:
[ 5152.818555][T180310]  <TASK>
[ 5152.818557][T180310]  uvm_va_space_single_gpu_in_parent+0x689/0x9d0 [nvidia_uvm]
[ 5152.818570][T180310]  uvm_va_space_destroy+0x1eb/0x500 [nvidia_uvm]
[ 5152.818582][T180310]  nv_kthread_q_run_self_test+0x395/0x39c0 [nvidia_uvm]
[ 5152.818589][T180310]  nv_kthread_q_run_self_test+0x50c/0x39c0 [nvidia_uvm]
[ 5152.818596][T180310]  nv_kthread_q_run_self_test+0x578/0x39c0 [nvidia_uvm]
[ 5152.818603][T180310]  __fput+0xe1/0x2a0
[ 5152.818605][T180310]  task_work_run+0x57/0x90
[ 5152.818608][T180310]  do_exit+0x29a/0xa20
[ 5152.818610][T180310]  do_group_exit+0x2b/0x80
[ 5152.818612][T180310]  get_signal+0x7ad/0x820
[ 5152.818615][T180310]  arch_do_signal_or_restart+0x39/0x240
[ 5152.818618][T180310]  exit_to_user_mode_loop+0x68/0x3c0
[ 5152.818620][T180310]  do_syscall_64+0x2d6/0x550
[ 5152.818626][T180310]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 5152.818628][T180310] RIP: 0033:0x7cc5126af692
[ 5152.818630][T180310] Code: Unable to access opcode bytes at 0x7cc5126af668.
[ 5152.818631][T180310] RSP: 002b:00007cc50adfecc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
[ 5152.818632][T180310] RAX: fffffffffffffdfc RBX: 00007cc50adff6c0 RCX: 00007cc5126af692
[ 5152.818633][T180310] RDX: ffffffffffffffff RSI: 0000000000000003 RDI: 00005ce98338e360
[ 5152.818634][T180310] RBP: 00007cc50adfedd0 R08: 0000000000000000 R09: 0000000000000000
[ 5152.818701][T180310] CR2: 00000000000000a0
[ 5152.818702][T180310] ---[ end trace 0000000000000000 ]---

Decoding the faulting instruction: The Code: field shows the faulting byte sequence (marked with <4c>) decodes as add r12, QWORD PTR [rax+0xa0] — a read of 8 bytes from rax+0xa0 where rax=0. This is a dereference of a NULL pointer with an offset of 0xa0 into what should be a GPU struct. RDI is also 0x0, consistent with a NULL GPU pointer being passed.

To Reproduce

Cannot reproduce on demand. The crash occurred once during normal system use while a CUDA workload was running (most likely background Steam shader pre-compilation via fossilize/nv-fossilize, which runs as UID 1000 and uses CUDA). The CUDA worker thread was signaled while blocked in sys_poll and crashed during exit cleanup.

The crash has not recurred since adding options nvidia_uvm uvm_disable_hmm=1 to /etc/modprobe.d/nvidia.conf, which is consistent with the fault being in the HMM path.

Bug Incidence

Happened once (single occurrence recovered from EFI pstore). Workaround (uvm_disable_hmm=1) applied since then.

nvidia-bug-report.log.gz

Not available — the system was not running when the crash was analyzed; all crash data was recovered from EFI pstore after reboot.

More Info

  • Related: HMM causes CUDA initialization failures on EL9 systems #901 (HMM causes CUDA initialization failures on EL9 — different symptom, same subsystem, same uvm_disable_hmm=1 workaround)
  • The kernel was also built with CONFIG_RANDSTRUCT_PERFORMANCE=y (Pentoo default). This is unrelated to this crash but is noted for completeness (it affects nvidia_drm/nvidia_modeset, not nvidia_uvm).
  • [S]=CPU_OUT_OF_SPEC taint flag is present because the BIOS has disabled eist (Intel SpeedStep) on this system, which the kernel detects. It is not related to this crash.
  • Driver was freshly rebuilt on the same day as the crash (build time 11:47, crash at ~13:17 = 90 minutes of uptime with CUDA active).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions