Skip to content

Support DPO#130

Merged
tastelikefeet merged 31 commits intomodelscope:mainfrom
tastelikefeet:feat/dpo
Mar 30, 2026
Merged

Support DPO#130
tastelikefeet merged 31 commits intomodelscope:mainfrom
tastelikefeet:feat/dpo

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

@tastelikefeet tastelikefeet commented Mar 27, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

  1. Support DPO
  2. Fix collect argument of forward decorator not correct
  3. Remove useless extend_message key in Trajectory
  4. Fix gas do_sync not correct
  5. Reorganize optimizer_group
  6. Fix bug that collect logic not correct for tp/pp/cp
  7. Fix bug that forward_only not work
  8. Fix bug that micro_batch kwargs sharding not correct
  9. Fix bug that grad division not correct
  10. Fix bug that train/eval use the same outputs and loss fields
  11. Fix bug that megatron optimizer raises duplicate input arguments
  12. Fix bug that lazy dataset will lose original dataset fields
  13. Support multi threads when tokenizing

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Direct Preference Optimization (DPO) and its variants (SimPO, ORPO, CPO) to the Twinkle framework, adding new loss functions, specialized data preprocessors, and a Ray-based training recipe. The Trajectory data format was updated to include user_data, and the template encoding logic was enhanced with parallel processing. Feedback identifies critical issues such as a type mismatch in the template encoding return value and a hardcoded parameter that breaks reference-free loss modes. Additionally, logical errors in conversation parsing and multiple inconsistencies between documentation and implementation regarding default values, configurable keys, and supported loss types were noted, along with opportunities to improve the robustness of message role parsing and preprocessor outputs.

@tastelikefeet
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Direct Preference Optimization (DPO) and its variants, including SimPO, ORPO, and CPO, across both Transformers and Megatron backends. Key changes include the addition of dedicated training scripts for full-parameter and LoRA-based DPO, new loss functions, and specialized metrics for preference alignment. The core architecture was refactored to unify optimizer state management through a new BaseOptimizerGroup class, and the Trajectory data format was updated to include user-defined data fields. Review feedback highlighted critical issues regarding hardcoded parameters in the DPO training script and incorrect logic in the gradient synchronization for accumulation steps, alongside a recommendation to reduce logging frequency in the FSDP2 cookbook.

@tastelikefeet tastelikefeet merged commit a6ad6fe into modelscope:main Mar 30, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants