Add MMLU benchmark evaluation to evals by CarlG0123 · Pull Request #1183 · TransformerLensOrg/TransformerLens

CarlG0123 · 2026-02-24T00:43:50Z

Summary

Add mmlu_eval() function for evaluating models on the MMLU (Massive Multitask Language Understanding) benchmark
Add make_mmlu_data_loader() for loading MMLU data from HuggingFace (cais/mmlu dataset)
Add MMLU_SUBJECTS (all 57 subjects) and MMLU_ANSWER_LETTERS module-level constants
Uses standard MMLU zero-shot letter-prediction format: show all choices (A-D) in prompt, compare log probabilities for each answer letter token

Changes

transformer_lens/evals.py: Added MMLU_SUBJECTS, MMLU_ANSWER_LETTERS, make_mmlu_data_loader(), and mmlu_eval()
tests/acceptance/test_evals.py: Added 5 tests covering data loading (single/multiple/invalid subjects) and evaluation

Testing

Verified with multiple models: Qwen1.5-7B (~40%), Mistral-7B (~45%), GPT-2, Pythia-1.4b
Results align with published zero-shot MMLU benchmarks for these model sizes
All existing tests continue to pass

Test plan

pytest tests/acceptance/test_evals.py -v
Verify MMLU data loading for single and multiple subjects
Verify evaluation produces reasonable accuracy scores

🤖 Generated with Claude Code

Skip MMLU docstring examples in doctest runs since they require network access (HuggingFace dataset download) and may require GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CarlG0123 · 2026-03-06T17:43:20Z

@jlarson4 @bryce13950 Hi, could one of you take a look at this PR to add MMLU benchmark evaluation? Thanks!

jlarson4 · 2026-03-06T21:55:08Z

transformer_lens/evals.py

+    subjects: Optional[Union[str, List[str]]] = None,
+    split: str = "test",
+    num_samples: Optional[int] = None,
+    device: str = "cuda",


if a user loads a model on CPU but forgets to pass device="cpu" here (or vice versa), they'll get a device mismatch error. Since model already knows its device, you can just use model.cfg.device internally and drop this parameter. That's also consistent with how ioi_eval works (it doesn't take a device arg).

Outside of this, looks good to my eyes! Make that tweak and I'll plan on including it in the next 2.x release

Thanks for the speedy review; it should be updated now!

@jlarson4 I also recently made another PR (#1195) which adds support for OpenAI's
open-weight MoE model, gpt-oss-20B. If you have time, it would be great if I could get that
approved as well. Thanks in advance!

Address PR review feedback: use model.cfg.device internally instead of accepting a device parameter, consistent with ioi_eval. This prevents device mismatch errors when users forget to pass the correct device.

Carl Gross and others added 2 commits February 12, 2026 16:00

adding MMLU to evals, updating corresponding tests

98113e2

Fix docstring tests for MMLU functions

14a4fb1

Skip MMLU docstring examples in doctest runs since they require network access (HuggingFace dataset download) and may require GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jlarson4 reviewed Mar 6, 2026

View reviewed changes

Remove device parameter from mmlu_eval, use model.cfg.device instead

eae677f

Address PR review feedback: use model.cfg.device internally instead of accepting a device parameter, consistent with ioi_eval. This prevents device mismatch errors when users forget to pass the correct device.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMLU benchmark evaluation to evals#1183

Add MMLU benchmark evaluation to evals#1183
CarlG0123 wants to merge 3 commits intoTransformerLensOrg:devfrom
CarlG0123:add-mmlu-eval

CarlG0123 commented Feb 24, 2026

Uh oh!

CarlG0123 commented Mar 6, 2026

Uh oh!

jlarson4 Mar 6, 2026

Uh oh!

jlarson4 Mar 6, 2026

Uh oh!

CarlG0123 Mar 6, 2026

Uh oh!

CarlG0123 Mar 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CarlG0123 commented Feb 24, 2026

Summary

Changes

Testing

Test plan

Uh oh!

CarlG0123 commented Mar 6, 2026

Uh oh!

jlarson4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

jlarson4 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

CarlG0123 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

CarlG0123 Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CarlG0123 Mar 7, 2026 •

edited

Loading