[Bug Report] Multi-GPU distributed training fails with Newton physics backend

## Describe the bug

Multi-GPU distributed training (`torch.distributed.run --distributed`) with the Newton physics backend fails due to multiple CUDA device management issues. All internal Warp/PyTorch allocations default to `cuda:0`, causing cross-device memory access errors when training on `cuda:1`.

## Steps to reproduce

```bash
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 \
  scripts/reinforcement_learning/rsl_rl/train.py \
  --task <any-newton-based-task> --headless --distributed
```

## System Info

- **IsaacLab branch**: `dev/newton`
- **GPUs**: 2x NVIDIA GeForce RTX 5090 (sm_120, Blackwell, **no P2P support**)
- **CUDA Toolkit**: 12.9, **Driver**: 13.0
- **Warp**: 1.11.1
- **OS**: Ubuntu Linux 6.17

## Root Causes

### 1. `NewtonManager` does not set CUDA/Warp default device before allocations

**File**: `source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py`

Both `start_simulation()` and `initialize_solver()` create Warp/Newton objects without first calling `torch.cuda.set_device()` / `wp.set_device()`. Internal allocations (e.g., inside `ModelBuilder.finalize()`, `SolverMuJoCo.__init__()`, `mujoco_warp` collision buffers) that use `wp.empty()` without an explicit `device` parameter default to `cuda:0`.

**Suggested fix** — add at the start of `start_simulation()` and `initialize_solver()`:

```python
device = PhysicsManager._device
if device and device.startswith("cuda"):
    import torch
    torch.cuda.set_device(device)
    wp.set_device(device)
```

### 2. `wp.ScopedCapture()` called without `device` parameter

**File**: `source/isaaclab_newton/isaaclab_newton/physics/newton_manager.py`, line ~379

```python
# Current:
with wp.ScopedCapture() as capture:

# Should be:
with wp.ScopedCapture(device=device) as capture:
```

### 3. Missing `device` in `Articulation.set_joint_position_target()`

**File**: `source/isaaclab_newton/isaaclab_newton/assets/articulation/articulation.py`, line ~1604

This is the **only** call to `make_complete_data_from_torch_dual_index` in the file that doesn't pass `device=self.device`. All other 19 calls pass it correctly.

```python
# Current (defaults to "cuda:0"):
make_complete_data_from_torch_dual_index(
    target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32
)

# Should be:
make_complete_data_from_torch_dual_index(
    target, self.num_instances, self.num_joints, env_ids, joint_ids, dtype=wp.float32, device=self.device
)
```

**Related**: `make_complete_data_from_torch_dual_index()` in `isaaclab/utils/warp/utils.py` has `device: str = "cuda:0"` as a hardcoded default. Consider changing this to dynamically resolve the current device.

### 4. `AppLauncher` does not clean up Kit-style `sys.argv` in standalone mode

**File**: `source/isaaclab/isaaclab/app/app_launcher.py`

In distributed mode, `AppLauncher` injects `--/plugins/carb.tasking.plugin/threadCount=N` into `sys.argv` (line 897). In Omniverse mode this is cleaned up inside `_create_app()` (lines 1011-1013), but in **standalone mode** (Newton/Rerun) `_create_app()` is never called, so the argument stays in `sys.argv` and Hydra fails:

```
train.py: error: unrecognized arguments: --/plugins/carb.tasking.plugin/threadCount=12
```

## Additional workaround needed for consumer GPUs

On GPUs **without P2P/NVLink support** (most consumer GPUs), NCCL's P2P transport causes `CUDA error: an illegal memory access was encountered` during `broadcast_object_list()`. This is likely a PyTorch/NCCL issue, not IsaacLab-specific, but it blocks all multi-GPU Newton training on consumer hardware.

**Workaround**:
```bash
NCCL_P2P_DISABLE=1 NCCL_SHM_DISABLE=1 python -m torch.distributed.run ...
```

Consider documenting this in the [multi-GPU training guide](https://isaac-sim.github.io/IsaacLab/main/source/features/multi_gpu.html).

## Verification

After applying fixes 1-4 and the NCCL workaround, distributed training with RSL-RL PPO runs successfully across 2 GPUs (2x1024 environments).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Multi-GPU distributed training fails with Newton physics backend #5132

Describe the bug

Steps to reproduce

System Info

Root Causes

1. `NewtonManager` does not set CUDA/Warp default device before allocations

2. `wp.ScopedCapture()` called without `device` parameter

3. Missing `device` in `Articulation.set_joint_position_target()`

4. `AppLauncher` does not clean up Kit-style `sys.argv` in standalone mode

Additional workaround needed for consumer GPUs

Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug Report] Multi-GPU distributed training fails with Newton physics backend #5132

Description

Describe the bug

Steps to reproduce

System Info

Root Causes

1. NewtonManager does not set CUDA/Warp default device before allocations

2. wp.ScopedCapture() called without device parameter

3. Missing device in Articulation.set_joint_position_target()

4. AppLauncher does not clean up Kit-style sys.argv in standalone mode

Additional workaround needed for consumer GPUs

Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `NewtonManager` does not set CUDA/Warp default device before allocations

2. `wp.ScopedCapture()` called without `device` parameter

3. Missing `device` in `Articulation.set_joint_position_target()`

4. `AppLauncher` does not clean up Kit-style `sys.argv` in standalone mode