Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Direct Preference Optimization (DPO) and its variants (SimPO, ORPO, CPO) to the Twinkle framework, adding new loss functions, specialized data preprocessors, and a Ray-based training recipe. The Trajectory data format was updated to include user_data, and the template encoding logic was enhanced with parallel processing. Feedback identifies critical issues such as a type mismatch in the template encoding return value and a hardcoded parameter that breaks reference-free loss modes. Additionally, logical errors in conversation parsing and multiple inconsistencies between documentation and implementation regarding default values, configurable keys, and supported loss types were noted, along with opportunities to improve the robustness of message role parsing and preprocessor outputs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Direct Preference Optimization (DPO) and its variants, including SimPO, ORPO, and CPO, across both Transformers and Megatron backends. Key changes include the addition of dedicated training scripts for full-parameter and LoRA-based DPO, new loss functions, and specialized metrics for preference alignment. The core architecture was refactored to unify optimizer state management through a new BaseOptimizerGroup class, and the Trajectory data format was updated to include user-defined data fields. Review feedback highlighted critical issues regarding hardcoded parameters in the DPO training script and incorrect logic in the gradient synchronization for accumulation steps, alongside a recommendation to reduce logging frequency in the FSDP2 cookbook.
PR type
PR information
Experiment results
Paste your experiment result here(if needed).