Skip to content

Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining#4276

Open
joapolarbear wants to merge 2 commits intoNVIDIA:mainfrom
joapolarbear:fix/interleaved-pp-overlap-moe-assert
Open

Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining#4276
joapolarbear wants to merge 2 commits intoNVIDIA:mainfrom
joapolarbear:fix/interleaved-pp-overlap-moe-assert

Conversation

@joapolarbear
Copy link
Copy Markdown

Description

Fix incorrect assertion in combined_1f1b_schedule_for_interleaved_pipelining function at line 237.

Issue

The assertion at line 237-238 checks:

if input_tensor is not None:
    assert input_tensor_grad is not None

However, this is logically incorrect. The input_tensor_grad is the backward pass output that corresponds to b_input_tensor (backward input tensor), not input_tensor (forward input tensor).

Root Cause Analysis

  • input_tensor is set when f_model is not None (forward model exists) at lines 186-192
  • b_input_tensor is set when b_model is not None (backward model exists) at lines 198-202
  • input_tensor_grad is computed based on b_input_tensor at lines 436-447
  • These two paths are independent, so the assertion should validate the backward path

Fix

Change line 237 from:

if input_tensor is not None:
    assert input_tensor_grad is not None

to:

if b_input_tensor is not None:
    assert input_tensor_grad is not None

This ensures the assertion correctly validates that input_tensor_grad is not None when the backward input tensor exists, matching the actual data flow.

Type of change

  • Bug fix

Testing

The fix corrects the logical condition to match the data flow in the combined forward-backward step computation.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@joapolarbear joapolarbear changed the title ADLR/megatron-lm - Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining Apr 13, 2026
@joapolarbear joapolarbear marked this pull request as ready for review April 13, 2026 13:34
@joapolarbear joapolarbear requested review from a team as code owners April 13, 2026 13:34
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 13, 2026 13:34
…ning

The assert checked forward microbatch's `input_tensor` against backward
microbatch's `input_tensor_grad`. In interleaved PP with VP>1, backward
has chunk reversal (`model_chunk_id = num_chunks - id - 1`), so forward
and backward are always on different VP stages in steady-state. This
causes false assertion failures when forward is on a non-first stage
(input_tensor != None) but backward is on a first stage
(input_tensor_grad == None).

Fix: check `b_input_tensor is not None` instead, which directly
corresponds to whether the backward microbatch received activation from
upstream and thus should produce input_tensor_grad.

Reproduction: PP=2 VP=2 EP=4 TP=1 with --overlap-moe-expert-parallel-comm
@joapolarbear joapolarbear force-pushed the fix/interleaved-pp-overlap-moe-assert branch from a65f38a to 8c8df83 Compare April 13, 2026 13:36
@yaox12 yaox12 requested a review from Wohox April 15, 2026 02:04
Copy link
Copy Markdown
Contributor

@Wohox Wohox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch, thanks!

@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Apr 15, 2026

/ok to test 4c8eb93

@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Apr 15, 2026

/claude review

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants