fix: normalize rewards per-group when sample counts are unequal by dubin555 · Pull Request #1655 · THUDM/slime

dubin555 · 2026-03-02T13:57:41Z

Based on the verification report and the repository's PR style (which tends to be concise), here's the PR body:

Summary

Fix group-level reward normalization in _post_process_rewards that silently degrades to batch-level normalization when training samples have unequal group sizes (e.g., due to early termination or aborted generations).

Root cause: In the else branch (unequal group sizes), rewards.view(-1, rewards.shape[-1]) on a 1D tensor of size N produces shape (1, N), so mean(dim=-1) computes a single global mean across all samples rather than per-group means. This makes the normalization incorrect — groups with higher raw rewards remain biased high, and groups with lower raw rewards remain biased low.

Fix: Use sample.group_index to compute per-group mean (and optionally std) via masked operations. The equal-group-size path (reshape-based) is unchanged.

Before (buggy):

Raw rewards:     [1.0, 2.0, 3.0, 4.0, 10.0, 12.0, 5.0, 6.0, 7.0]
Group indices:   [0, 0, 0, 0, 1, 1, 2, 2, 2]

Normalized group means: [-3.056, 5.444, 0.444]  (should all be ≈0.0)

After (fixed):

Normalized group means: [0.0, 0.0, 0.0]  ✓

Closes #1414

Changes

slime/ray/rollout.py: Replace view(-1, shape[-1]) reshape with per-group masked normalization using group_index
tests/test_group_norm_unequal.py: Add regression test suite (5 cases: mean normalization, std normalization, single-sample groups, equal-group backward compatibility, and explicit verification that old logic was buggy)

Test plan

New regression tests: 5/5 passed
Existing tests unaffected (test_chunked_gae 9/9, test_fsdp_import 1/1)
Training run with actual unequal group sizes (requires GPU)

…M#1414) When training samples have unequal group sizes, `_post_process_rewards` fell into the else branch which did `rewards.view(-1, rewards.shape[-1])` on a 1D tensor. Since `shape[-1]` equals the total sample count, this reshaped to (1, N), causing `mean(dim=-1)` to compute a single global mean instead of per-group means — making group normalization incorrect. Fix: use `sample.group_index` to identify groups and compute per-group mean (and optionally std) normalization via masked operations. The equal-group-size path (if branch) is unchanged. Closes THUDM#1414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize rewards per-group when sample counts are unequal#1655

fix: normalize rewards per-group when sample counts are unequal#1655
dubin555 wants to merge 1 commit intoTHUDM:mainfrom
dubin555:oss-scout/verify-fix-group-norm-unequal-samples

dubin555 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dubin555 commented Mar 2, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant