[Dev] Support delayed wgrad compute overlap with P2P backward by Wohox · Pull Request #4268 · NVIDIA/Megatron-LM

Wohox · 2026-04-13T06:39:14Z

What does this PR do ?

Improve P2P Communication Overlap in Fine-Grained 1F1B Schedule

Problem

In the interleaved 1F1B pipeline schedule with fine-grained layer scheduling (model_chunk_schedule_plan.py), P2P communication latency is exposed due to two issues:

Backward P2P runs on the computation stream. Unlike forward P2P (post_forward), which is placed on the communication stream, backward P2P (post_backward) runs on the computation stream. This prevents backward P2P from overlapping with the next VPP stage's forward compute.
Insufficient wgrad compute to hide P2P latency. Only the last backward layer's attention wgrad is deferred after P2P. When this single GEMM is not long enough to cover the P2P communication time, the remaining P2P latency is exposed.

Solution

Three independently controllable flags are introduced, all orthogonal and composable:

--overlap-p2p-backward-on-comm-stream (bool, default False): Places post_backward on the communication stream (mirroring post_forward), so backward P2P and computation can run in parallel on separate CUDA streams.
--overlap-p2p-wgrad-delayed-layer-number N (int, default 1): Controls how many of the last backward layers defer their weight gradient computation to after post_backward. Increasing N accumulates more wgrad GEMMs to overlap with P2P. Default of 1 preserves the original behavior.
--overlap-p2p-with-transformer-layer-wgrad (bool, default False): When enabled, deferred layers delay both attention and MLP weight gradients (instead of attention only), providing more compute to hide P2P latency.

Usage

Flags can be combined freely. Examples:

# Move backward P2P to comm stream only
--overlap-p2p-backward-on-comm-stream

# Move backward P2P to comm stream + defer attn+mlp wgrad of last 2 layers
--overlap-p2p-backward-on-comm-stream \
--overlap-p2p-wgrad-delayed-layer-number 2 \
--overlap-p2p-with-transformer-layer-wgrad

# Defer attn wgrad of last 3 layers (without moving P2P to comm stream)
--overlap-p2p-wgrad-delayed-layer-number 3

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-13T06:39:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

init commit

efeb49d

Wohox and others added 3 commits April 16, 2026 17:00

Merge branch 'dev' into pingtian/ep_overlapping_p2p_hiding

500e792

Merge branch 'dev' into pingtian/ep_overlapping_p2p_hiding

4207671

try fix overlap comm stream

7d69fbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Support delayed wgrad compute overlap with P2P backward#4268

[Dev] Support delayed wgrad compute overlap with P2P backward#4268
Wohox wants to merge 4 commits intoNVIDIA:devfrom
Wohox:pingtian/ep_overlapping_p2p_hiding

Wohox commented Apr 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wohox commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Improve P2P Communication Overlap in Fine-Grained 1F1B Schedule

Problem

Solution

Usage

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wohox commented Apr 13, 2026 •

edited

Loading