Support DPO by tastelikefeet · Pull Request #130 · modelscope/twinkle

tastelikefeet · 2026-03-27T09:02:52Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Support DPO
Fix collect argument of forward decorator not correct
Remove useless extend_message key in Trajectory
Fix gas do_sync not correct
Reorganize optimizer_group
Fix bug that collect logic not correct for tp/pp/cp
Fix bug that forward_only not work
Fix bug that micro_batch kwargs sharding not correct
Fix bug that grad division not correct
Fix bug that train/eval use the same outputs and loss fields
Fix bug that megatron optimizer raises duplicate input arguments
Fix bug that lazy dataset will lose original dataset fields
Support multi threads when tokenizing

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist

Code Review

This pull request introduces Direct Preference Optimization (DPO) and its variants (SimPO, ORPO, CPO) to the Twinkle framework, adding new loss functions, specialized data preprocessors, and a Ray-based training recipe. The Trajectory data format was updated to include user_data, and the template encoding logic was enhanced with parallel processing. Feedback identifies critical issues such as a type mismatch in the template encoding return value and a hardcoded parameter that breaks reference-free loss modes. Additionally, logical errors in conversation parsing and multiple inconsistencies between documentation and implementation regarding default values, configurable keys, and supported loss types were noted, along with opportunities to improve the robustness of message role parsing and preprocessor outputs.

src/twinkle/template/base.py

cookbook/rl/dpo.py

src/twinkle/preprocessor/dpo.py

cookbook/rl/dpo_full.py

src/twinkle/loss/dpo.py

src/twinkle/preprocessor/dpo.py

tastelikefeet · 2026-03-29T11:47:28Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive support for Direct Preference Optimization (DPO) and its variants, including SimPO, ORPO, and CPO, across both Transformers and Megatron backends. Key changes include the addition of dedicated training scripts for full-parameter and LoRA-based DPO, new loss functions, and specialized metrics for preference alignment. The core architecture was refactored to unify optimizer state management through a new BaseOptimizerGroup class, and the Trajectory data format was updated to include user-defined data fields. Review feedback highlighted critical issues regarding hardcoded parameters in the DPO training script and incorrect logic in the gradient synchronization for accumulation steps, alongside a recommendation to reduce logging frequency in the FSDP2 cookbook.

cookbook/rl/dpo_full.py

src/twinkle/model/optimizer_group.py

cookbook/transformers/fsdp2.py

src/twinkle/model/megatron/megatron.py

tastelikefeet added 10 commits March 26, 2026 21:05

support dpo

563fbdd

fix

52978e9

fix

f5d5961

fix

3cf03cd

fix

bcdad64

fix

ee3602c

wip

bebbe78

wip

3f8d1a3

wip

d1f223f

wip

3a25caa

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

tastelikefeet added 17 commits March 27, 2026 20:08

wip

0cf1ac3

wip

8c662f0

wip

aa86099

wip

6bdaaca

wip

c9b4f28

wip

c7a1465

wip

1c8bcd2

wip

f4fe545

wip

ed00c1b

wip

bebe60e

wip

8fc2bb7

wip

20fde35

wip

91cad80

wip

c75e43f

wip

c2cb1dd

fix

f2e26dd

lint

9289600

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

cookbook/rl/dpo_full.py Outdated Show resolved Hide resolved

src/twinkle/model/optimizer_group.py Outdated Show resolved Hide resolved

cookbook/transformers/fsdp2.py Outdated Show resolved Hide resolved

tastelikefeet added 3 commits March 29, 2026 19:58

fix

78bd147

fix

deaf96a

fix ga step

30b6411

hjh0119 approved these changes Mar 29, 2026

View reviewed changes

src/twinkle/model/megatron/megatron.py Outdated Show resolved Hide resolved

fix

fb87391

tastelikefeet merged commit a6ad6fe into modelscope:main Mar 30, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DPO#130

Support DPO#130
tastelikefeet merged 31 commits intomodelscope:mainfrom
tastelikefeet:feat/dpo

tastelikefeet commented Mar 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tastelikefeet commented Mar 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tastelikefeet commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tastelikefeet commented Mar 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tastelikefeet commented Mar 27, 2026 •

edited

Loading