Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining by joapolarbear · Pull Request #4276 · NVIDIA/Megatron-LM

joapolarbear · 2026-04-13T13:33:05Z

Description

Fix incorrect assertion in combined_1f1b_schedule_for_interleaved_pipelining function at line 237.

Issue

The assertion at line 237-238 checks:

if input_tensor is not None:
    assert input_tensor_grad is not None

However, this is logically incorrect. The input_tensor_grad is the backward pass output that corresponds to b_input_tensor (backward input tensor), not input_tensor (forward input tensor).

Root Cause Analysis

input_tensor is set when f_model is not None (forward model exists) at lines 186-192
b_input_tensor is set when b_model is not None (backward model exists) at lines 198-202
input_tensor_grad is computed based on b_input_tensor at lines 436-447
These two paths are independent, so the assertion should validate the backward path

Fix

Change line 237 from:

if input_tensor is not None:
    assert input_tensor_grad is not None

to:

if b_input_tensor is not None:
    assert input_tensor_grad is not None

This ensures the assertion correctly validates that input_tensor_grad is not None when the backward input tensor exists, matching the actual data flow.

Type of change

Bug fix

Testing

The fix corrects the logical condition to match the data flow in the combined forward-backward step computation.

copy-pr-bot · 2026-04-13T13:33:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ning The assert checked forward microbatch's `input_tensor` against backward microbatch's `input_tensor_grad`. In interleaved PP with VP>1, backward has chunk reversal (`model_chunk_id = num_chunks - id - 1`), so forward and backward are always on different VP stages in steady-state. This causes false assertion failures when forward is on a non-first stage (input_tensor != None) but backward is on a first stage (input_tensor_grad == None). Fix: check `b_input_tensor is not None` instead, which directly corresponds to whether the backward microbatch received activation from upstream and thus should produce input_tensor_grad. Reproduction: PP=2 VP=2 EP=4 TP=1 with --overlap-moe-expert-parallel-comm

Wohox

LGTM, nice catch, thanks!

Phlip79 · 2026-04-15T16:19:42Z

/ok to test 4c8eb93

Phlip79 · 2026-04-15T16:19:52Z

/claude review

claude

LGTM

github-actions bot added the community-request label Apr 13, 2026

joapolarbear changed the title ~~ADLR/megatron-lm - Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining~~ Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining Apr 13, 2026

joapolarbear marked this pull request as ready for review April 13, 2026 13:34

joapolarbear requested review from a team as code owners April 13, 2026 13:34

svcnvidia-nemo-ci requested a review from a team April 13, 2026 13:34

joapolarbear force-pushed the fix/interleaved-pp-overlap-moe-assert branch from a65f38a to 8c8df83 Compare April 13, 2026 13:36

Merge branch 'main' into fix/interleaved-pp-overlap-moe-assert

4c8eb93

yaox12 requested a review from Wohox April 15, 2026 02:04

Wohox approved these changes Apr 15, 2026

View reviewed changes

Phlip79 approved these changes Apr 15, 2026

View reviewed changes

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 15, 2026

copy-pr-bot bot temporarily deployed to test April 15, 2026 16:20 Inactive

claude bot approved these changes Apr 15, 2026

View reviewed changes

yashaswikarnati approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining#4276

Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining#4276
joapolarbear wants to merge 2 commits intoNVIDIA:mainfrom
joapolarbear:fix/interleaved-pp-overlap-moe-assert

joapolarbear commented Apr 13, 2026

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

Wohox left a comment

Uh oh!

Phlip79 commented Apr 15, 2026

Uh oh!

Phlip79 commented Apr 15, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

joapolarbear commented Apr 13, 2026

Description

Issue

Root Cause Analysis

Fix

Type of change

Testing

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

Wohox left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented Apr 15, 2026

Uh oh!

Phlip79 commented Apr 15, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants