feat: token merging for image classification by rensortino · Pull Request #537 · PrunaAI/pruna

rensortino · 2026-02-17T00:02:29Z

Description

This PR introduces the Token Merging (ToMe) algorithm for HuggingFace Vision Transformer models. Token Merging progressively merges similar tokens between the attention and MLP stages of each transformer block, significantly reducing the number of tokens and speeding up inference with minimal quality loss.

Using model google/vit-base-patch16-224, speedup is over 2x with r=8.

Key Changes:

Token Merging Algorithm:

Implements the ToMe algorithm adapted from facebook/ToMe paper
Custom ViT module classes (ToMeViTLayer, ToMeViTSelfAttention) that extend HuggingFace transformers
Supports proportional attention weighting based on merged token sizes
Bipartite soft matching for intelligent token pair selection
Configurable token reduction schedule with per-layer control
Model wrapper for state management across forward passes

Testing Infrastructure:

Added ViT model fixtures for comprehensive testing
Token Merging test class with validation scenarios

Related Issue

Fixes #399

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Token Merging algorithm tested with HuggingFace ViT models
Test fixtures added for google/vit-base-patch16-224 model family
Integration tests verify proper token reduction and attention output handling
Validated compatibility with existing Pruna pipeline

Implementation Details

Token Merging Core Features:

Bipartite Soft Matching: Intelligently selects which token pairs to merge based on key similarity
Proportional Attention: Adjusts attention weights by the log of merged token sizes
Configurable Reduction Schedule:
- Constant r across all layers
- Per-layer list specification
- Inflection-based schedules (increasing/decreasing/constant)
Class Swapping Pattern: Dynamically replaces HF module classes at runtime to inject ToMe behavior
Metric Storage: Uses key layer mean as similarity metric for matching

Hyperparameters:

r (int, 0-128): Number of tokens to merge per layer (default: 16)
trace_source (bool): Track merge provenance for visualization
prop_attn (bool): Enable proportional attention weighting (default: True)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

Design Decisions:

Module-level class definitions: ToMeViTLayer and related classes are defined at module level (not inside methods) to ensure they are picklable for distributed training and model serialization.
Eager attention enforcement: The ToMeViTSelfAttention class uses eager attention computation to inject the proportional attention bias between QK matmul and softmax operations.
Shared mutable state: All ToMe modules share a single tome_info dict for efficient state management across layers.

Future Enhancements:

Extension to other transformer architectures (Flux, SAM, etc.)
Support for custom attention mechanisms

References:

Paper: Token Merging: Your ViT But Faster
Original Implementation: https://github.com/facebookresearch/ToMe

src/pruna/algorithms/token_merging.py

tests/algorithms/testers/token_merging.py

llcnt

Thanks a lot for the nice contribution :):)
Could you provide a working example of this new integration please (eg. a script or a notebook)? I tried to run it with a ViTForImageClassification but it fails (see comment below). I tried to run it with a pipeline from transformers, now the smashing works, but the inference fails: the base pipeline can accept str (url of images) or raw images. But the smashed pipeline can not. It would be nice to fix this so that the base model and the smashed one behave similarly. I tried to follow what you did in the new test by preprocessing the image before feeding it into the smashed pipeline, but I still get the error: TypeError: ViTAttention.forward() got an unexpected keyword argument 'output_attentions'.
Also could you fix the cursot[bot] comments (some of them are quite relevant ;) ) ?
Thx in advance!

src/pruna/algorithms/token_merging.py

github-actions · 2026-03-06T00:16:24Z

This PR has been inactive for 10 days and is now marked as stale.

rensortino · 2026-03-06T09:30:43Z

Hi, thanks a lot for the feedback! I am working on the issues raised by Bugbot and will provide you shortly a notebook with a basic example on how to test the algorithm

rensortino · 2026-03-06T17:27:29Z

Here you can find a notebook to test the algorithm on HF models and pipelines.

github-actions · 2026-03-17T00:11:59Z

This PR has been inactive for 10 days and is now marked as stale.

llcnt · 2026-03-18T16:54:58Z

bugbot run

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-03-18T17:01:25Z

src/pruna/algorithms/token_merging.py

+
+    def forward(self, *args: Any, **kwargs: Any) -> Any:
+        """Initialise ToMe state and forward through the wrapped model."""
+        self._tome_info["r"] = self.parsed_r


Forward pass mutates parsed_r, breaking subsequent inferences

High Severity

ToMeModelWrapper.forward assigns self.parsed_r directly to self._tome_info["r"], making them the same list object. Each ToMeViTLayer.forward then calls .pop(0) on that list, which destructively mutates self.parsed_r. After the first forward pass, self.parsed_r is empty. Any subsequent forward call assigns the empty list, and the first layer's .pop(0) raises an IndexError. The assignment needs to be a copy, e.g. list(self.parsed_r).

Additional Locations (1)

src/pruna/algorithms/token_merging.py#L297-L298

llcnt

Sorry for the delay, I was stuck with other tasks :(
Thanks for the updates, and for the notebook!
I was not able to run all the notebook though, because:

is_vit is not implemented. I guess you have added it lately, but did not re-run the notebook;
TypeError: ViTAttention.forward() got an unexpected keyword argument 'output_attentions' pops out when I run inference of the smashed model. I guess because you are using an old version of transformers. Can you print it ? And make sure it is compatible with newer version (4.56.0 is a good starting point I would say) ;)

Thank you again for your contribution:)

cursor bot reviewed Feb 17, 2026

View reviewed changes

sdiazlor requested a review from llcnt February 17, 2026 14:38

llcnt requested changes Feb 23, 2026

View reviewed changes

src/pruna/algorithms/token_merging.py Outdated Show resolved Hide resolved

src/pruna/algorithms/token_merging.py Outdated Show resolved Hide resolved

github-actions bot added the stale label Mar 6, 2026

rensortino added 9 commits March 6, 2026 18:22

feat: add Token Merging algorithm

1f0df0c

test: add Token Merging test class

cccfd18

test: add fixtures for ViT models

27d7a9a

fix: adapt tome function signatures to HF classes

f2df347

chore: remove print statements from test code

718211e

fix: add division by zero safeguard in _parse_r

5f638fd

fix: change model_check_fn to support HF models and pipelines

b8ddea6

fix: multiply head_mask instead of adding it

a4e2b1c

perf: pre-compute parsed r at init

be3a6ed

rensortino force-pushed the feat/token-merging branch from a6a79ec to be3a6ed Compare March 6, 2026 17:23

github-actions bot removed the stale label Mar 7, 2026

github-actions bot added the stale label Mar 17, 2026

llcnt removed the stale label Mar 18, 2026

cursor bot reviewed Mar 18, 2026

View reviewed changes

llcnt requested changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: token merging for image classification#537

feat: token merging for image classification#537
rensortino wants to merge 9 commits intoPrunaAI:mainfrom
rensortino:feat/token-merging

rensortino commented Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llcnt left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

rensortino commented Mar 6, 2026

Uh oh!

rensortino commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

llcnt commented Mar 18, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 18, 2026

Uh oh!

llcnt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rensortino commented Feb 17, 2026

Description

Key Changes:

Related Issue

Type of Change

How Has This Been Tested?

Implementation Details

Token Merging Core Features:

Hyperparameters:

Checklist

Additional Notes

Design Decisions:

Future Enhancements:

References:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

rensortino commented Mar 6, 2026

Uh oh!

rensortino commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

llcnt commented Mar 18, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

Forward pass mutates parsed_r, breaking subsequent inferences

Uh oh!

llcnt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Forward pass mutates `parsed_r`, breaking subsequent inferences