NULL pointer dereference in uvm_hmm_unregister_gpu during CUDA UVM VA space teardown on process exit

### NVIDIA Open GPU Kernel Modules Version

595.58.03

### Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

- [ ] I confirm that this does not happen with the proprietary driver package.

**Note**: The system runs exclusively with the open kernel driver (`USE=kernel-open` on Gentoo/Pentoo). The proprietary driver has not been tested with this specific crash scenario. However, related HMM issues (see #901) have been confirmed open-driver-specific, and the crash occurs at an offset inside `nvidia_uvm` where HMM state is accessed.

### Operating System and Version

Pentoo Linux (Gentoo-based), OpenRC

### Kernel Release

6.19.9-pentoo (custom build, 2026-03-26, `PREEMPT(voluntary)`)

### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

- [x] I am running on a stable kernel release.

### Hardware: GPU

NVIDIA GeForce RTX 3080 Mobile (GA104, PCI ID `10de:24a0`, PCI slot `0000:01:00.0`)  
System: System76 Oryx Pro, Intel Core i7-12700H + RTX 3080 Mobile hybrid graphics (Optimus/PRIME), BIOS 2022-07-20_ae6aa72

### Describe the bug

A kernel NULL pointer dereference occurs in `uvm_hmm_unregister_gpu+0x40` inside `nvidia_uvm` during UVM VA space teardown when a CUDA worker thread exits. The fault address is `0x00000000000000a0` (NULL + 0xa0). `RAX` is 0x0 at the point of the fault; the faulting instruction attempts to read from `[rax+0xa0]`.

The crashing thread (`cuda00001800007`, PID 180310, UID 1000) was blocked in `sys_poll` (userspace `ORIG_RAX=7`) when it received a signal and was killed, triggering `do_exit` → UVM VA space cleanup → the crash.

The driver version (595.58.03, built 2026-03-26 11:47) had been running without issue for approximately 85 minutes before the crash. The CUDA workload was active during that entire period.

Adding `options nvidia_uvm uvm_disable_hmm=1` to modprobe configuration prevents the crash, confirming the fault is in the HMM (Heterogeneous Memory Management) code path of `nvidia_uvm`.

### Crash dump (recovered from EFI pstore)

```
[ 5152.818503][T180310] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 5152.818509][T180310] #PF: supervisor read access in kernel mode
[ 5152.818510][T180310] #PF: error_code(0x0000) - not-present page
[ 5152.818512][T180310] PGD 0 P4D 0 
[ 5152.818514][T180310] Oops: Oops: 0000 [#1] SMP NOPTI
[ 5152.818517][T180310] CPU: 4 UID: 1000 PID: 180310 Comm: cuda00001800007 Tainted: G S         OE       6.19.9-pentoo #1 PREEMPT(voluntary)
[ 5152.818519][T180310] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 5152.818520][T180310] Hardware name: System76 Oryx Pro/Oryx Pro, BIOS 2022-07-20_ae6aa72 07/20/2022
[ 5152.818521][T180310] RIP: 0010:uvm_hmm_unregister_gpu+0x40/0x360 [nvidia_uvm]
[ 5152.818540][T180310] Code: ec 30 48 89 54 24 08 e8 7e fb ff ff 88 44 24 1f 84 c0 0f 84 be 01 00 00 48 8b 45 00 4c 8b a5 98 00 00 00 48 8b 80 00 06 00 00 <4c> 03 a0 a0 00 00 00 4c 89 e0 4c 03 a5 90 00 00 00 48 c1 e8 0c 49
[ 5152.818542][T180310] RSP: 0018:ffffcbcaae95bb00 EFLAGS: 00010202
[ 5152.818543][T180310] RAX: 0000000000000000 RBX: ffffcbca8b8fd008 RCX: ffffcbcaae95bbc0
[ 5152.818544][T180310] RDX: 0000000000000000 RSI: ffff88f954a36000 RDI: 0000000000000000
[ 5152.818545][T180310] RBP: ffff88f954a36000 R08: 0000000000000000 R09: 0000000000000000
[ 5152.818546][T180310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 5152.818547][T180310] R13: ffffcbca8b8fd078 R14: 0000000000000000 R15: ffffcbca8b8fd0a8
[ 5152.818548][T180310] FS:  0000000000000000(0000) GS:ffff890114cb1000(0000) knlGS:0000000000000000
[ 5152.818549][T180310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5152.818550][T180310] CR2: 00000000000000a0 CR3: 0000000434220000 CR4: 0000000000f52ef0
[ 5152.818553][T180310] PKRU: 55555554
[ 5152.818553][T180310] Call Trace:
[ 5152.818555][T180310]  <TASK>
[ 5152.818557][T180310]  uvm_va_space_single_gpu_in_parent+0x689/0x9d0 [nvidia_uvm]
[ 5152.818570][T180310]  uvm_va_space_destroy+0x1eb/0x500 [nvidia_uvm]
[ 5152.818582][T180310]  nv_kthread_q_run_self_test+0x395/0x39c0 [nvidia_uvm]
[ 5152.818589][T180310]  nv_kthread_q_run_self_test+0x50c/0x39c0 [nvidia_uvm]
[ 5152.818596][T180310]  nv_kthread_q_run_self_test+0x578/0x39c0 [nvidia_uvm]
[ 5152.818603][T180310]  __fput+0xe1/0x2a0
[ 5152.818605][T180310]  task_work_run+0x57/0x90
[ 5152.818608][T180310]  do_exit+0x29a/0xa20
[ 5152.818610][T180310]  do_group_exit+0x2b/0x80
[ 5152.818612][T180310]  get_signal+0x7ad/0x820
[ 5152.818615][T180310]  arch_do_signal_or_restart+0x39/0x240
[ 5152.818618][T180310]  exit_to_user_mode_loop+0x68/0x3c0
[ 5152.818620][T180310]  do_syscall_64+0x2d6/0x550
[ 5152.818626][T180310]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 5152.818628][T180310] RIP: 0033:0x7cc5126af692
[ 5152.818630][T180310] Code: Unable to access opcode bytes at 0x7cc5126af668.
[ 5152.818631][T180310] RSP: 002b:00007cc50adfecc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
[ 5152.818632][T180310] RAX: fffffffffffffdfc RBX: 00007cc50adff6c0 RCX: 00007cc5126af692
[ 5152.818633][T180310] RDX: ffffffffffffffff RSI: 0000000000000003 RDI: 00005ce98338e360
[ 5152.818634][T180310] RBP: 00007cc50adfedd0 R08: 0000000000000000 R09: 0000000000000000
[ 5152.818701][T180310] CR2: 00000000000000a0
[ 5152.818702][T180310] ---[ end trace 0000000000000000 ]---
```

**Decoding the faulting instruction**: The `Code:` field shows the faulting byte sequence (marked with `<4c>`) decodes as `add r12, QWORD PTR [rax+0xa0]` — a read of 8 bytes from `rax+0xa0` where `rax=0`. This is a dereference of a NULL pointer with an offset of 0xa0 into what should be a GPU struct. `RDI` is also 0x0, consistent with a NULL GPU pointer being passed.

### To Reproduce

Cannot reproduce on demand. The crash occurred once during normal system use while a CUDA workload was running (most likely background Steam shader pre-compilation via `fossilize`/`nv-fossilize`, which runs as UID 1000 and uses CUDA). The CUDA worker thread was signaled while blocked in `sys_poll` and crashed during exit cleanup.

The crash has not recurred since adding `options nvidia_uvm uvm_disable_hmm=1` to `/etc/modprobe.d/nvidia.conf`, which is consistent with the fault being in the HMM path.

### Bug Incidence

Happened once (single occurrence recovered from EFI pstore). Workaround (`uvm_disable_hmm=1`) applied since then.

### nvidia-bug-report.log.gz

Not available — the system was not running when the crash was analyzed; all crash data was recovered from EFI pstore after reboot.

### More Info

- Related: #901 (HMM causes CUDA initialization failures on EL9 — different symptom, same subsystem, same `uvm_disable_hmm=1` workaround)
- The kernel was also built with `CONFIG_RANDSTRUCT_PERFORMANCE=y` (Pentoo default). This is unrelated to this crash but is noted for completeness (it affects `nvidia_drm`/`nvidia_modeset`, not `nvidia_uvm`).
- `[S]=CPU_OUT_OF_SPEC` taint flag is present because the BIOS has disabled `eist` (Intel SpeedStep) on this system, which the kernel detects. It is not related to this crash.
- Driver was freshly rebuilt on the same day as the crash (build time 11:47, crash at ~13:17 = 90 minutes of uptime with CUDA active).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NULL pointer dereference in uvm_hmm_unregister_gpu during CUDA UVM VA space teardown on process exit #1082

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Crash dump (recovered from EFI pstore)

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NULL pointer dereference in uvm_hmm_unregister_gpu during CUDA UVM VA space teardown on process exit #1082

Description

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Crash dump (recovered from EFI pstore)

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions