Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
/ok to test fe41baf |
|
Same comment, more important for main branch PR: #4273 (comment)
megatron/training/checkpointing.py |
fe41baf to
55dccd6
Compare
|
/ok to test 55dccd6 |
Guard that unwrap_model correctly peels through both DDP and Megatron-FSDP wrapping hierarchies to reach the underlying GPTModel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcac924 to
3132d62
Compare
|
Hi @cspades . I've check all the usage of For double check, here is claude's review opinion. |
|
/ok to test db7e4f1 |
db7e4f1 to
3132d62
Compare
|
/ok to test 1a64947 |
|
@wplf cc @shjwudp What error or bug does this fix again? I'm still looking but if there is any situation where the unwrapped, sharded model is computed, it could maybe dodge this wrapper module logic ( that points the model to the un-sharded parameter data that it needs for a forward pass. Another example, we already address the Could I get some visibility on what bug this fixes? |
|
Currently megatron-lm itself shouldn’t have this corner case; it appears in our latest code that we’re writing but haven’t merged yet. Most importantly, we expect the unwrap_model function to return the original model. This is true when the model is wrapped with DDP, but it is wrong for megatron-FSDP, which is very weird. If there are any calls to unwrap_model on a Megatron-FSDP model that expect it to return the |
What does this PR do ?
dev PR: #4273
Problem: unwrap_model() in megatron/core/utils.py gets stuck when unwrapping a model wrapped with Megatron-FSDP. The wrapping hierarchy is:
FullyShardedDataParallel (mcore adapter)
└── .module → MegatronFSDP (core FSDP impl)
└── .module → actual model (e.g., GPTModel)
The old code only knew how to peel through DDP, torch_FSDP, megatron_FSDP (the adapter), and Float16Module. It would unwrap the outer FullyShardedDataParallel but then hit the inner MegatronFSDP and stop — returning MegatronFSDP instead of the actual model.
Fix: One-line change — adds MegatronFSDP (from megatron.core.distributed.fsdp.src.megatron_fsdp.megatron_fsdp) to the default module_instances tuple, so the while isinstance(...) loop can peel through both wrapper layers.