Skip to content

Refactor megatron to mcore_bridge#134

Open
tastelikefeet wants to merge 17 commits intomodelscope:mainfrom
tastelikefeet:feat/mbridge
Open

Refactor megatron to mcore_bridge#134
tastelikefeet wants to merge 17 commits intomodelscope:mainfrom
tastelikefeet:feat/mbridge

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Megatron-Core integration by offloading model configuration, creation, and weight loading logic to the mcore_bridge dependency, which allows for the removal of significant internal boilerplate code. However, the review highlights several critical issues introduced during the refactoring: the removal of the _BASE_LAYER_SUFFIXES constant and the self.hf_config attribute will lead to runtime errors since they are still referenced in the codebase. Furthermore, the send_weights method contains a NameError due to the use of an undefined args variable, which should be replaced with values from the strategy configuration.

args = get_args()
org_vocab_size = getattr(self.hf_config, 'vocab_size', args.padded_vocab_size)
org_vocab_size = getattr(self.hf_config, 'vocab_size', self.strategy.config.padded_vocab_size)
_padded_vocab_size = args.padded_vocab_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The args variable is used here but is no longer defined in this scope after the refactoring. This will cause a NameError. The padded_vocab_size should be retrieved from self.strategy.config.

Suggested change
_padded_vocab_size = args.padded_vocab_size
_padded_vocab_size = self.strategy.config.padded_vocab_size

@tastelikefeet
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Megatron-Core integration by delegating model configuration and creation to a new MegatronStrategy class and the mcore_bridge library, while removing the internal args.py and the TorchSampler module. The review identifies a critical AttributeError in multi_lora.py where get_target_modules is called on the wrong class. Additionally, the review highlights that the refactored _add_base_layer_suffix is now too generic and may incorrectly modify non-LoRA layer names, and that MultiLoraMegatron initializes parameter configurations without an optimizer, which could prevent the setup of training-specific features like gradient reduction overlap.

Comment on lines +1401 to +1409
def _add_base_layer_suffix(params):
_BASE_LAYER_SUFFIXES = ['weight', 'bias']
for name, param in params:
for suffix in _BASE_LAYER_SUFFIXES:
if name.endswith(suffix):
attr = suffix.rsplit('.', 1)[-1] # 'weight' or 'bias'
name = f'{name[:-len(attr)]}base_layer.{attr}'
break
yield name, param
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation of _add_base_layer_suffix has become overly generic. It now adds a .base_layer. prefix to any parameter name ending in weight or bias. The previous implementation correctly used a specific list of suffixes for layers that are typically targeted by LoRA. This change might incorrectly modify parameter names for non-LoRA layers (e.g., model.norm.weight), which could lead to weight loading failures in downstream systems like vLLM when LoRA is enabled. It is recommended to revert to a more specific list of target suffixes to ensure only LoRA-adapted layers are modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant