Skip to content

Add MMLU benchmark evaluation to evals#1183

Open
CarlG0123 wants to merge 3 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval
Open

Add MMLU benchmark evaluation to evals#1183
CarlG0123 wants to merge 3 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval

Conversation

@CarlG0123
Copy link

Summary

  • Add mmlu_eval() function for evaluating models on the MMLU (Massive Multitask Language Understanding) benchmark
  • Add make_mmlu_data_loader() for loading MMLU data from HuggingFace (cais/mmlu dataset)
  • Add MMLU_SUBJECTS (all 57 subjects) and MMLU_ANSWER_LETTERS module-level constants
  • Uses standard MMLU zero-shot letter-prediction format: show all choices (A-D) in prompt, compare log probabilities for each answer letter token

Changes

  • transformer_lens/evals.py: Added MMLU_SUBJECTS, MMLU_ANSWER_LETTERS, make_mmlu_data_loader(), and mmlu_eval()
  • tests/acceptance/test_evals.py: Added 5 tests covering data loading (single/multiple/invalid subjects) and evaluation

Testing

  • Verified with multiple models: Qwen1.5-7B (~40%), Mistral-7B (~45%), GPT-2, Pythia-1.4b
  • Results align with published zero-shot MMLU benchmarks for these model sizes
  • All existing tests continue to pass

Test plan

  • pytest tests/acceptance/test_evals.py -v
  • Verify MMLU data loading for single and multiple subjects
  • Verify evaluation produces reasonable accuracy scores

🤖 Generated with Claude Code

Carl Gross and others added 2 commits February 12, 2026 16:00
Skip MMLU docstring examples in doctest runs since they require
network access (HuggingFace dataset download) and may require GPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CarlG0123
Copy link
Author

@jlarson4 @bryce13950 Hi, could one of you take a look at this PR to add MMLU benchmark evaluation? Thanks!

subjects: Optional[Union[str, List[str]]] = None,
split: str = "test",
num_samples: Optional[int] = None,
device: str = "cuda",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user loads a model on CPU but forgets to pass device="cpu" here (or vice versa), they'll get a device mismatch error. Since model already knows its device, you can just use model.cfg.device internally and drop this parameter. That's also consistent with how ioi_eval works (it doesn't take a device arg).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of this, looks good to my eyes! Make that tweak and I'll plan on including it in the next 2.x release

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy review; it should be updated now!

Copy link
Author

@CarlG0123 CarlG0123 Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlarson4 I also recently made another PR (#1195) which adds support for OpenAI's
open-weight MoE model, gpt-oss-20B. If you have time, it would be great if I could get that
approved as well. Thanks in advance!

Address PR review feedback: use model.cfg.device internally instead of
accepting a device parameter, consistent with ioi_eval. This prevents
device mismatch errors when users forget to pass the correct device.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants