[megatron] enable bucketed weight sync for non-colocated nccl weight sync in megatron#1324
Merged
SumanthRH merged 6 commits intoNovaSky-AI:mainfrom Mar 14, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enable packed broadcast for non-colocated Megatron weight sync
Bucketing was previously only enabled for the CUDA IPC strategy (colocated mode). This extends it to the broadcast strategy (non-colocated mode), packing all tensors in each bucket into a single contiguous buffer and broadcasting it in one NCCL operation — matching how CUDA IPC already works. This reduces both per-tensor NCCL overhead and HTTP round-trips, which matters most for MoE models with many small expert parameters.
Changes
broadcast_strategy.py: Addsizesfield toBroadcastWeightUpdateRequest; sender packs tensors into one buffer and broadcasts once per bucket; receiver unpacks using sizesmegatron_worker.py: Always enable bucketing, removing theCudaIpcTransferStrategy-only guardtest_megatron_worker.py: Addnon_colocated_moetest entry forMoonlight-16B-A3B-InstructResults
Partial weight sync on 4xL40S (4 layers only)
Before:

After:

Full weight sync on 8xH100 (4 inf, 4 train) with Moonlight-16B-A3B
Before:

After:

Test