Skip to content

[Dev] Support delayed wgrad compute overlap with P2P backward#4268

Draft
Wohox wants to merge 4 commits intoNVIDIA:devfrom
Wohox:pingtian/ep_overlapping_p2p_hiding
Draft

[Dev] Support delayed wgrad compute overlap with P2P backward#4268
Wohox wants to merge 4 commits intoNVIDIA:devfrom
Wohox:pingtian/ep_overlapping_p2p_hiding

Conversation

@Wohox
Copy link
Copy Markdown
Contributor

@Wohox Wohox commented Apr 13, 2026

What does this PR do ?

Improve P2P Communication Overlap in Fine-Grained 1F1B Schedule

Problem

In the interleaved 1F1B pipeline schedule with fine-grained layer scheduling (model_chunk_schedule_plan.py), P2P communication latency is exposed due to two issues:

  1. Backward P2P runs on the computation stream. Unlike forward P2P (post_forward), which is placed on the communication stream, backward P2P (post_backward) runs on the computation stream. This prevents backward P2P from overlapping with the next VPP stage's forward compute.

  2. Insufficient wgrad compute to hide P2P latency. Only the last backward layer's attention wgrad is deferred after P2P. When this single GEMM is not long enough to cover the P2P communication time, the remaining P2P latency is exposed.

Solution

Three independently controllable flags are introduced, all orthogonal and composable:

  • --overlap-p2p-backward-on-comm-stream (bool, default False): Places post_backward on the communication stream (mirroring post_forward), so backward P2P and computation can run in parallel on separate CUDA streams.

  • --overlap-p2p-wgrad-delayed-layer-number N (int, default 1): Controls how many of the last backward layers defer their weight gradient computation to after post_backward. Increasing N accumulates more wgrad GEMMs to overlap with P2P. Default of 1 preserves the original behavior.

  • --overlap-p2p-with-transformer-layer-wgrad (bool, default False): When enabled, deferred layers delay both attention and MLP weight gradients (instead of attention only), providing more compute to hide P2P latency.

Usage

Flags can be combined freely. Examples:

# Move backward P2P to comm stream only
--overlap-p2p-backward-on-comm-stream

# Move backward P2P to comm stream + defer attn+mlp wgrad of last 2 layers
--overlap-p2p-backward-on-comm-stream \
--overlap-p2p-wgrad-delayed-layer-number 2 \
--overlap-p2p-with-transformer-layer-wgrad

# Defer attn wgrad of last 3 layers (without moving P2P to comm stream)
--overlap-p2p-wgrad-delayed-layer-number 3

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant