Skip to content

[megatron] enable bucketed weight sync for non-colocated nccl weight sync in megatron#1324

Merged
SumanthRH merged 6 commits intoNovaSky-AI:mainfrom
erictang000:enable_nccl_bucketing_megatron
Mar 14, 2026
Merged

[megatron] enable bucketed weight sync for non-colocated nccl weight sync in megatron#1324
SumanthRH merged 6 commits intoNovaSky-AI:mainfrom
erictang000:enable_nccl_bucketing_megatron

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Mar 14, 2026

Enable packed broadcast for non-colocated Megatron weight sync

Bucketing was previously only enabled for the CUDA IPC strategy (colocated mode). This extends it to the broadcast strategy (non-colocated mode), packing all tensors in each bucket into a single contiguous buffer and broadcasting it in one NCCL operation — matching how CUDA IPC already works. This reduces both per-tensor NCCL overhead and HTTP round-trips, which matters most for MoE models with many small expert parameters.

Changes

  • broadcast_strategy.py: Add sizes field to BroadcastWeightUpdateRequest; sender packs tensors into one buffer and broadcasts once per bucket; receiver unpacks using sizes
  • megatron_worker.py: Always enable bucketing, removing the CudaIpcTransferStrategy-only guard
  • test_megatron_worker.py: Add non_colocated_moe test entry for Moonlight-16B-A3B-Instruct

Results

Partial weight sync on 4xL40S (4 layers only)

Before:
image

After:
image

Full weight sync on 8xH100 (4 inf, 4 train) with Moonlight-16B-A3B

Before:
image

After:
image

Test

uv run --isolated --extra dev --extra megatron -- pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py::test_megatron_policy_weight_sync -k non_colocated

Open with Devin

gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@erictang000 erictang000 requested a review from SumanthRH March 14, 2026 17:53
Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@SumanthRH SumanthRH merged commit 84fea6f into NovaSky-AI:main Mar 14, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants