Skip to content

fix: normalize rewards per-group when sample counts are unequal#1655

Open
dubin555 wants to merge 1 commit intoTHUDM:mainfrom
dubin555:oss-scout/verify-fix-group-norm-unequal-samples
Open

fix: normalize rewards per-group when sample counts are unequal#1655
dubin555 wants to merge 1 commit intoTHUDM:mainfrom
dubin555:oss-scout/verify-fix-group-norm-unequal-samples

Conversation

@dubin555
Copy link
Copy Markdown

@dubin555 dubin555 commented Mar 2, 2026

Based on the verification report and the repository's PR style (which tends to be concise), here's the PR body:


Summary

Fix group-level reward normalization in _post_process_rewards that silently degrades to batch-level normalization when training samples have unequal group sizes (e.g., due to early termination or aborted generations).

Root cause: In the else branch (unequal group sizes), rewards.view(-1, rewards.shape[-1]) on a 1D tensor of size N produces shape (1, N), so mean(dim=-1) computes a single global mean across all samples rather than per-group means. This makes the normalization incorrect — groups with higher raw rewards remain biased high, and groups with lower raw rewards remain biased low.

Fix: Use sample.group_index to compute per-group mean (and optionally std) via masked operations. The equal-group-size path (reshape-based) is unchanged.

Before (buggy):

Raw rewards:     [1.0, 2.0, 3.0, 4.0, 10.0, 12.0, 5.0, 6.0, 7.0]
Group indices:   [0, 0, 0, 0, 1, 1, 2, 2, 2]

Normalized group means: [-3.056, 5.444, 0.444]  (should all be ≈0.0)

After (fixed):

Normalized group means: [0.0, 0.0, 0.0]  ✓

Closes #1414

Changes

  • slime/ray/rollout.py: Replace view(-1, shape[-1]) reshape with per-group masked normalization using group_index
  • tests/test_group_norm_unequal.py: Add regression test suite (5 cases: mean normalization, std normalization, single-sample groups, equal-group backward compatibility, and explicit verification that old logic was buggy)

Test plan

  • New regression tests: 5/5 passed
  • Existing tests unaffected (test_chunked_gae 9/9, test_fsdp_import 1/1)
  • Training run with actual unequal group sizes (requires GPU)

…M#1414)

When training samples have unequal group sizes, `_post_process_rewards`
fell into the else branch which did `rewards.view(-1, rewards.shape[-1])`
on a 1D tensor. Since `shape[-1]` equals the total sample count, this
reshaped to (1, N), causing `mean(dim=-1)` to compute a single global
mean instead of per-group means — making group normalization incorrect.

Fix: use `sample.group_index` to identify groups and compute per-group
mean (and optionally std) normalization via masked operations. The
equal-group-size path (if branch) is unchanged.

Closes THUDM#1414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Group normalization in _post_process_rewards normalizes across entire batch when sample counts are unequal

1 participant