Skip to content

[Bug Report] Multi-GPU distributed training fails with Newton physics backend #5132

@jkkim-irim

Description

@jkkim-irim

Describe the bug

Multi-GPU distributed training (torch.distributed.run --distributed) with the Newton physics backend fails due to multiple CUDA device management issues. All internal Warp/PyTorch allocations default to cuda:0, causing cross-device memory access errors when training on cuda:1.

Steps to reproduce

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 \
  scripts/reinforcement_learning/rsl_rl/train.py \
  --task <any-newton-based-task> --headless --distributed

System Info

  • IsaacLab branch: dev/newton
  • GPUs: 2x NVIDIA GeForce RTX 5090 (sm_120, Blackwell, no P2P support)
  • CUDA Toolkit: 12.9, Driver: 13.0
  • Warp: 1.11.1
  • OS: Ubuntu Linux 6.17

Root Causes

1. NewtonManager does not set CUDA/Warp default device before allocations

File: source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py

Both start_simulation() and initialize_solver() create Warp/Newton objects without first calling torch.cuda.set_device() / wp.set_device(). Internal allocations (e.g., inside ModelBuilder.finalize(), SolverMuJoCo.__init__(), mujoco_warp collision buffers) that use wp.empty() without an explicit device parameter default to cuda:0.

Suggested fix — add at the start of start_simulation() and initialize_solver():

device = PhysicsManager._device
if device and device.startswith("cuda"):
    import torch
    torch.cuda.set_device(device)
    wp.set_device(device)

2. wp.ScopedCapture() called without device parameter

File: source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py, line ~379

# Current:
with wp.ScopedCapture() as capture:

# Should be:
with wp.ScopedCapture(device=device) as capture:

3. Missing device in Articulation.set_joint_position_target()

File: source/isaaclab_newton/isaaclab_newton/assets/articulation/articulation.py, line ~1604

This is the only call to make_complete_data_from_torch_dual_index in the file that doesn't pass device=self.device. All other 19 calls pass it correctly.

# Current (defaults to "cuda:0"):
make_complete_data_from_torch_dual_index(
    target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32
)

# Should be:
make_complete_data_from_torch_dual_index(
    target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32, device=self.device
)

Related: make_complete_data_from_torch_dual_index() in isaaclab/utils/warp/utils.py has device: str = "cuda:0" as a hardcoded default. Consider changing this to dynamically resolve the current device.

4. AppLauncher does not clean up Kit-style sys.argv in standalone mode

File: source/isaaclab/isaaclab/app/app_launcher.py

In distributed mode, AppLauncher injects --/plugins/carb.tasking.plugin/threadCount=N into sys.argv (line 897). In Omniverse mode this is cleaned up inside _create_app() (lines 1011-1013), but in standalone mode (Newton/Rerun) _create_app() is never called, so the argument stays in sys.argv and Hydra fails:

train.py: error: unrecognized arguments: --/plugins/carb.tasking.plugin/threadCount=12

Additional workaround needed for consumer GPUs

On GPUs without P2P/NVLink support (most consumer GPUs), NCCL's P2P transport causes CUDA error: an illegal memory access was encountered during broadcast_object_list(). This is likely a PyTorch/NCCL issue, not IsaacLab-specific, but it blocks all multi-GPU Newton training on consumer hardware.

Workaround:

NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 python -m torch.distributed.run ...

Consider documenting this in the multi-GPU training guide.

Verification

After applying fixes 1-4 and the NCCL workaround, distributed training with RSL-RL PPO runs successfully across 2 GPUs (2x1024 environments).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions