Describe the bug
Multi-GPU distributed training (torch.distributed.run --distributed) with the Newton physics backend fails due to multiple CUDA device management issues. All internal Warp/PyTorch allocations default to cuda:0, causing cross-device memory access errors when training on cuda:1.
Steps to reproduce
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 \
scripts/reinforcement_learning/rsl_rl/train.py \
--task <any-newton-based-task> --headless --distributed
System Info
- IsaacLab branch:
dev/newton
- GPUs: 2x NVIDIA GeForce RTX 5090 (sm_120, Blackwell, no P2P support)
- CUDA Toolkit: 12.9, Driver: 13.0
- Warp: 1.11.1
- OS: Ubuntu Linux 6.17
Root Causes
1. NewtonManager does not set CUDA/Warp default device before allocations
File: source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py
Both start_simulation() and initialize_solver() create Warp/Newton objects without first calling torch.cuda.set_device() / wp.set_device(). Internal allocations (e.g., inside ModelBuilder.finalize(), SolverMuJoCo.__init__(), mujoco_warp collision buffers) that use wp.empty() without an explicit device parameter default to cuda:0.
Suggested fix — add at the start of start_simulation() and initialize_solver():
device = PhysicsManager._device
if device and device.startswith("cuda"):
import torch
torch.cuda.set_device(device)
wp.set_device(device)
2. wp.ScopedCapture() called without device parameter
File: source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py, line ~379
# Current:
with wp.ScopedCapture() as capture:
# Should be:
with wp.ScopedCapture(device=device) as capture:
3. Missing device in Articulation.set_joint_position_target()
File: source/isaaclab_newton/isaaclab_newton/assets/articulation/articulation.py, line ~1604
This is the only call to make_complete_data_from_torch_dual_index in the file that doesn't pass device=self.device. All other 19 calls pass it correctly.
# Current (defaults to "cuda:0"):
make_complete_data_from_torch_dual_index(
target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32
)
# Should be:
make_complete_data_from_torch_dual_index(
target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32, device=self.device
)
Related: make_complete_data_from_torch_dual_index() in isaaclab/utils/warp/utils.py has device: str = "cuda:0" as a hardcoded default. Consider changing this to dynamically resolve the current device.
4. AppLauncher does not clean up Kit-style sys.argv in standalone mode
File: source/isaaclab/isaaclab/app/app_launcher.py
In distributed mode, AppLauncher injects --/plugins/carb.tasking.plugin/threadCount=N into sys.argv (line 897). In Omniverse mode this is cleaned up inside _create_app() (lines 1011-1013), but in standalone mode (Newton/Rerun) _create_app() is never called, so the argument stays in sys.argv and Hydra fails:
train.py: error: unrecognized arguments: --/plugins/carb.tasking.plugin/threadCount=12
Additional workaround needed for consumer GPUs
On GPUs without P2P/NVLink support (most consumer GPUs), NCCL's P2P transport causes CUDA error: an illegal memory access was encountered during broadcast_object_list(). This is likely a PyTorch/NCCL issue, not IsaacLab-specific, but it blocks all multi-GPU Newton training on consumer hardware.
Workaround:
NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 python -m torch.distributed.run ...
Consider documenting this in the multi-GPU training guide.
Verification
After applying fixes 1-4 and the NCCL workaround, distributed training with RSL-RL PPO runs successfully across 2 GPUs (2x1024 environments).
Describe the bug
Multi-GPU distributed training (
torch.distributed.run --distributed) with the Newton physics backend fails due to multiple CUDA device management issues. All internal Warp/PyTorch allocations default tocuda:0, causing cross-device memory access errors when training oncuda:1.Steps to reproduce
System Info
dev/newtonRoot Causes
1.
NewtonManagerdoes not set CUDA/Warp default device before allocationsFile:
source/isaaclab_newton/isaaclab_newton/physics/newton_manager.pyBoth
start_simulation()andinitialize_solver()create Warp/Newton objects without first callingtorch.cuda.set_device()/wp.set_device(). Internal allocations (e.g., insideModelBuilder.finalize(),SolverMuJoCo.__init__(),mujoco_warpcollision buffers) that usewp.empty()without an explicitdeviceparameter default tocuda:0.Suggested fix — add at the start of
start_simulation()andinitialize_solver():2.
wp.ScopedCapture()called withoutdeviceparameterFile:
source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py, line ~3793. Missing
deviceinArticulation.set_joint_position_target()File:
source/isaaclab_newton/isaaclab_newton/assets/articulation/articulation.py, line ~1604This is the only call to
make_complete_data_from_torch_dual_indexin the file that doesn't passdevice=self.device. All other 19 calls pass it correctly.Related:
make_complete_data_from_torch_dual_index()inisaaclab/utils/warp/utils.pyhasdevice: str = "cuda:0"as a hardcoded default. Consider changing this to dynamically resolve the current device.4.
AppLauncherdoes not clean up Kit-stylesys.argvin standalone modeFile:
source/isaaclab/isaaclab/app/app_launcher.pyIn distributed mode,
AppLauncherinjects--/plugins/carb.tasking.plugin/threadCount=Nintosys.argv(line 897). In Omniverse mode this is cleaned up inside_create_app()(lines 1011-1013), but in standalone mode (Newton/Rerun)_create_app()is never called, so the argument stays insys.argvand Hydra fails:Additional workaround needed for consumer GPUs
On GPUs without P2P/NVLink support (most consumer GPUs), NCCL's P2P transport causes
CUDA error: an illegal memory access was encounteredduringbroadcast_object_list(). This is likely a PyTorch/NCCL issue, not IsaacLab-specific, but it blocks all multi-GPU Newton training on consumer hardware.Workaround:
Consider documenting this in the multi-GPU training guide.
Verification
After applying fixes 1-4 and the NCCL workaround, distributed training with RSL-RL PPO runs successfully across 2 GPUs (2x1024 environments).