fix mfsdp unwrap stuck at MegatronFSDP by wplf · Pull Request #4274 · NVIDIA/Megatron-LM

wplf · 2026-04-13T10:58:01Z

What does this PR do ?

dev PR: #4273

Problem: unwrap_model() in megatron/core/utils.py gets stuck when unwrapping a model wrapped with Megatron-FSDP. The wrapping hierarchy is:

FullyShardedDataParallel (mcore adapter)
└── .module → MegatronFSDP (core FSDP impl)
└── .module → actual model (e.g., GPTModel)

The old code only knew how to peel through DDP, torch_FSDP, megatron_FSDP (the adapter), and Float16Module. It would unwrap the outer FullyShardedDataParallel but then hit the inner MegatronFSDP and stop — returning MegatronFSDP instead of the actual model.

Fix: One-line change — adds MegatronFSDP (from megatron.core.distributed.fsdp.src.megatron_fsdp.megatron_fsdp) to the default module_instances tuple, so the while isinstance(...) loop can peel through both wrapper layers.

copy-pr-bot · 2026-04-13T10:58:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-13T10:58:12Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

cspades · 2026-04-13T17:00:33Z

/ok to test fe41baf

cspades · 2026-04-13T17:01:16Z

Same comment, more important for main branch PR: #4273 (comment)

And also there is a list of files that use this helper function, can we double-check / justify that Megatron-FSDP model logic surrounding the use of this utility is all valid?

megatron/training/checkpointing.py
megatron/training/utils.py
megatron/core/inference/text_generation_controllers/text_generation_controller.py
megatron/training/training.py
megatron/post_training/checkpointing.py
megatron/rl/rl_utils.py
megatron/core/models/mimo/model/base.py
megatron/inference/utils.py

wplf · 2026-04-14T01:45:22Z

/ok to test 55dccd6

Guard that unwrap_model correctly peels through both DDP and Megatron-FSDP wrapping hierarchies to reach the underlying GPTModel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wplf · 2026-04-14T13:11:20Z

Hi @cspades . I've check all the usage of unwrap_model function.
Besides, I've add unittest for this function, to make sure unwrap_model will return its base model.

For double check, here is claude's review opinion.

  Wrapping Hierarchy Confirmed

  From mcore_fsdp_adapter.py:158-173, the chain is:
  FullyShardedDataParallel (adapter, inherits _BaseDataParallel)
    └── .module → MegatronFSDP (core FSDP impl, nn.Module)
          └── .module → actual model (e.g., GPTModel)

  The adapter calls super().__init__(module=MegatronFSDP(..., module=module)), so _BaseDataParallel sets self.module = MegatronFSDP(...),
  and MegatronFSDP sets self.module = module (the actual model) at line 239.

  The Fix

  The fix adds MegatronFSDPModule to the default module_instances tuple so the while isinstance(model_module, module_instances):
  model_module = model_module.module loop can peel through both layers. Previously it stopped at MegatronFSDP because only the outer
  adapter (megatron_FSDP) was listed.

  Per-File Analysis

  All call sites use unwrap_model() with default arguments (no custom module_instances), so they all benefit from this fix uniformly.

  1. megatron/training/checkpointing.py (lines 522, 1611, 2061)

  - Usage: Unwraps model before saving/loading state dicts. Expects to reach the core model (e.g., GPTModel) with
  state_dict_for_save_checkpoint().
  - Safe? Yes. Previously with FSDP wrapping, this would return MegatronFSDP which doesn't have state_dict_for_save_checkpoint. Now it
  correctly reaches the actual model. Note: the adapter already overrides state_dict_for_save_checkpoint at line 184, so checkpointing
  through the wrapper also works, but unwrapping fully is still the intended path.

  2. megatron/training/utils.py (line 43 — import only)

  - Usage: Just re-exports unwrap_model from megatron.core.utils. No direct call here.
  - Safe? Yes, no behavioral change.

  3. megatron/core/inference/text_generation_controllers/text_generation_controller.py (lines 85, 575, 853, 1633, 1656, 1689, 1882, 2010)

  - Usage: Unwraps to access vocab_size, model_config, _decoder_hidden_states_cache, set_decode_expert_padding, etc. on the actual model.
  - Safe? Yes. These attributes live on the actual model (e.g., GPTModel), not on MegatronFSDP. The fix ensures we reach the actual model
  instead of stopping at MegatronFSDP. Inference is unlikely to use FSDP wrapping currently, but if it did, this would be necessary.

  4. megatron/training/training.py (lines 1626, 1857, 1876, 1903, 1227, 3162)

  - Usage: Unwraps to access named_parameters(), cancel_gradients_last_layer(), update_momentum(), and RL-related weight management.
  - Safe? Yes. All these expect the actual model. named_parameters() on MegatronFSDP would return FSDP-managed params, not the clean
  parameter names. Getting the actual model is correct.

  5. megatron/post_training/checkpointing.py (line 172)

  - Usage: Unwraps for post-training checkpoint loading (quantization, etc.).
  - Safe? Yes. Same pattern as training checkpointing — needs the actual model.

  6. megatron/rl/rl_utils.py (lines 211-212, 514, 852, 894, 1923)

  - Usage: Unwraps for RL weight swapping (train_core/inf_core), weight prefetching, set_decode_expert_padding, and setting train mode.
  - Safe? Yes. These all need the actual model. Notably unwrap_model(training_model[0]) at line 1923 calls .train() on the result — this
  needs to be the actual model, not the FSDP wrapper.

  7. megatron/core/models/mimo/model/base.py (lines 303, 476)

  - Usage: Unwraps self.language_model to call .embedding() and .set_input_tensor().
  - Safe? Yes. These are methods on the actual language model (e.g., GPTModel), not on any wrapper.

  8. megatron/inference/utils.py (line 80)

  - Usage: Unwraps for MXFP8 quantization (quantize_model_to_mxfp8).
  - Safe? Yes. Quantization needs to access actual model parameters/layers.

  Conclusion

  The fix is safe across all call sites. Every caller uses default module_instances and expects to get the actual underlying model. The
  change simply allows the unwrap loop to peel through the MegatronFSDP layer that was previously a dead end. No caller relies on stopping
  at MegatronFSDP — they all need the actual model.

  The only theoretical concern would be if some code intentionally wanted to stop at MegatronFSDP, but no such pattern exists in the
  codebase — all callers access attributes/methods that only exist on the actual model.

wplf · 2026-04-14T13:11:53Z

/ok to test db7e4f1

wplf · 2026-04-14T14:09:49Z

/ok to test 1a64947

cspades · 2026-04-16T05:09:32Z

@wplf cc @shjwudp What error or bug does this fix again?

I'm still looking but if there is any situation where the unwrapped, sharded model is computed, it could maybe dodge this wrapper module logic (self._replace_param_with_raw_if_needed()):

    def forward(self, *inputs, **kwargs):
        """
        Wrapped forward pass of the model managed by FSDP.
        """
        self._replace_param_with_raw_if_needed()
        with torch.autograd.profiler.record_function("CustomFSDP.forward"):
            # Call the forward pass of the wrapped module.
            output = self.module.forward(*inputs, **kwargs)
            return output

that points the model to the un-sharded parameter data that it needs for a forward pass.

Another example, we already address the state_dict_for_save_checkpoint implementation in the adapter and MegatronFSDP class: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/fsdp/mcore_fsdp_adapter.py#L177-L188

        self.param_and_grad_buffer = self.module.param_and_grad_buffer
        self.no_sync = self.module.no_sync
        self.start_param_sync = self.module.start_param_sync
        self.start_grad_sync = self.module.start_grad_sync
        self.finish_grad_sync = self.module.finish_grad_sync
        self.scale_gradients = self.module.scale_gradients
        self.zero_grad_buffer = self.module.zero_grad_buffer
        self.broadcast_params = self.module.broadcast_params
        self.synchronize_param_gather = self.module.synchronize_param_gather
        self.module.state_dict_for_save_checkpoint = self.module.state_dict
        self.state_dict_for_save_checkpoint = self.state_dict
        self.module.config = config

Could I get some visibility on what bug this fixes?

wplf · 2026-04-16T09:28:37Z

Currently megatron-lm itself shouldn’t have this corner case; it appears in our latest code that we’re writing but haven’t merged yet.

Most importantly, we expect the unwrap_model function to return the original model. This is true when the model is wrapped with DDP, but it is wrong for megatron-FSDP, which is very weird.

If there are any calls to unwrap_model on a Megatron-FSDP model that expect it to return the MegatronFSDP itself, I think we should fix those.

wplf requested review from a team as code owners April 13, 2026 10:58

svcnvidia-nemo-ci marked this pull request as draft April 13, 2026 10:58

wplf mentioned this pull request Apr 13, 2026

fix mfsdp unwrap stuck at MegatronFSDP [dev] #4273

Merged

wplf marked this pull request as ready for review April 13, 2026 11:01

svcnvidia-nemo-ci requested a review from a team April 13, 2026 11:01

svcnvidia-nemo-ci added Final Review PR is in the "final review" stage complexity: low labels Apr 13, 2026

wplf added the module: megatron-fsdp label Apr 13, 2026

cspades requested a review from a team April 13, 2026 17:00

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 13, 2026

ericharper removed the Final Review PR is in the "final review" stage label Apr 13, 2026

fix mfsdp unwrap stuck at MegatronFSDP

55dccd6

wplf force-pushed the jinliang/fix-fsdp-unwrap-main branch from fe41baf to 55dccd6 Compare April 14, 2026 01:44

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 14, 2026

copy-pr-bot bot temporarily deployed to test April 14, 2026 01:46 Inactive

add unittest for unwrap_model with DDP and Megatron-FSDP

3132d62

Guard that unwrap_model correctly peels through both DDP and Megatron-FSDP wrapping hierarchies to reach the underlying GPTModel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wplf force-pushed the jinliang/fix-fsdp-unwrap-main branch from dcac924 to 3132d62 Compare April 14, 2026 13:07

copy-pr-bot bot temporarily deployed to test April 14, 2026 13:12 Inactive

wplf force-pushed the jinliang/fix-fsdp-unwrap-main branch from db7e4f1 to 3132d62 Compare April 14, 2026 14:08

Merge branch 'main' into jinliang/fix-fsdp-unwrap-main

1a64947

copy-pr-bot bot temporarily deployed to test April 14, 2026 14:11 Inactive

ericharper requested a review from cspades April 14, 2026 21:58

ericharper added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix mfsdp unwrap stuck at MegatronFSDP#4274

fix mfsdp unwrap stuck at MegatronFSDP#4274
wplf wants to merge 3 commits intoNVIDIA:mainfrom
wplf:jinliang/fix-fsdp-unwrap-main

wplf commented Apr 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

cspades commented Apr 13, 2026

Uh oh!

cspades commented Apr 13, 2026 •

edited

Loading

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

cspades commented Apr 16, 2026 •

edited

Loading

Uh oh!

wplf commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wplf commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

cspades commented Apr 13, 2026

Uh oh!

cspades commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

wplf commented Apr 14, 2026

Uh oh!

cspades commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wplf commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wplf commented Apr 13, 2026 •

edited

Loading

cspades commented Apr 13, 2026 •

edited

Loading

cspades commented Apr 16, 2026 •

edited

Loading

wplf commented Apr 16, 2026 •

edited

Loading