[ACL 2026] ⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li¹, Yupeng Hu^1✉, Zhiheng Fu¹, Zhiwei Chen¹, Yongqi Li², Liqiang Nie³

¹School of Software, Shandong University
²Department of Computing, Hong Kong Polytechnic University
³School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)
^✉Corresponding author

Official Repository: This is an open-source implementation of the paper "TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval".

📌 Introduction

TEMA (Text-oriented Entity Mapping Architecture) is the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios while seamlessly accommodating simple modifications. Prevailing CIR setups rely on simple modification texts, which typically cover only a limited range of salient changes. This induces two critical limitations highly relevant to practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.

To bring CIR closer to real-world use, we introduce two instruction-rich multi-modification datasets: M-FashionIQ and M-CIRR. Through MMT parsing and entity mapping, TEMA actively perceives and structurally models these complex modifications to achieve precise retrieval.

⬆ Back to top

📢 News

[2026.04.07] 🔥 TEMA was accepted by ACL 2026!
[2026.04.06] 🚀 Released all training and evaluation codes.

⬆ Back to top

✨ Key Contributions

📊 New Benchmarks (M-FashionIQ & M-CIRR): We construct two instruction-intensive datasets that replace short, simplistic texts with Multi-Modification Texts (MMT). These are generated by MLLM and verified by human annotators to explicitly present constraint structures with multiple entities and clauses.
🧠 MMT Parsing Assistant (PA): Designed to address "Insufficient Entity Coverage". It utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities through summarization and checks.
🔗 MMT-oriented Entity Mapping (EM): Tackles the "Clause-Entity Misalignment" issue. It introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side, stabilizing "one-to-many" relationship modeling.
🏆 Superior Performance: Extensive experiments on four benchmark datasets demonstrate TEMA's superiority in both original and multi-modification scenarios.

⬆ Back to top

🏗️ Architecture

1. Data Generation Pipeline

Figure 1. Pipeline of the construction of our proposed multi-modification CIR datasets.

2. TEMA Framework

Figure 2. The overall framework of TEMA, comprising the MMT Parsing Assistant (PA) utilized only during training, and the MMT-oriented Entity Mapping (EM) module for multimodal alignment.

⬆ Back to top

🚀 Experimental Results

Figure 3. Performance comparison on M-FashionIQ and M-CIRR relative to R@K(%). The overall best results are in bold, while best results over baselines are underlined. The Avg metric in M-CIRR denotes (R@5 + Rsubset@1) / 2.

⬆ Back to top

📦 Installation

1. Clone the repository

git clone https://github.com/lee-zixu/ACL26-TEMA
cd TEMA

2. Setup Python Environment

The code is evaluated on Python 3.10.8 and PyTorch 2.5.1 using an NVIDIA A40 48G GPU. We recommend using Anaconda to create an isolated virtual environment:

conda create -n tema python=3.10.8
conda activate tema

# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install core dependencies
pip install transformers==4.25.0

⬆ Back to top

📂 Data Preparation

We evaluated our framework on our newly proposed M-FashionIQ and M-CIRR datasets. Please prepare the data by following the steps below:

1. Fashion-domain Dataset: M-FashionIQ

First, download the FashionIQ dataset following the instructions in the official repository. After downloading, to obtain our proposed M-FashionIQ dataset, replace the captions folder with our provided mmt_captions.

Ensure the folder structure matches the following:

├── M-FashionIQ
│   ├── mmt_captions
│   │   ├── cap.dress.[train | val].mmt.json
│   │   ├── cap.toptee.[train | val].mmt.json
│   │   ├── cap.shirt.[train | val].mmt.json
│   ├── image_splits
│   │   ├── split.dress.[train | val | test].json
│   │   ├── split.toptee.[train | val | test].json
│   │   ├── split.shirt.[train | val | test].json
│   ├── dress
│   │   ├── [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
│   ├── shirt
│   │   ├── [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
│   ├── toptee
│   │   ├── [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]

2. Open-domain Dataset: M-CIRR

First, download the CIRR dataset following the instructions in the official repository. After downloading, to obtain our proposed M-CIRR dataset, replace the captions folder with our provided mmt_captions.

Ensure the folder structure matches the following:

├── M-CIRR
│   ├── train
│   │   ├── [0 | 1 | 2 | ...]
│   │   │   ├── [train-10108-0-img0.png | train-10108-0-img1.png | ...]
│   ├── dev
│   │   ├── [dev-0-0-img0.png | dev-0-0-img1.png | ...]
│   ├── test1
│   │   ├── [test1-0-0-img0.png | test1-0-0-img1.png | ...]
│   ├── mcirr
│   │   ├── mmt_captions
│   │   │   ├── cap.rc2.[train | val | test1].mmt.json
│   │   ├── image_splits
│   │   │   ├── split.rc2.[train | val | test1].json

⬆ Back to top

🚀 Quick Start

Training Phase

To start training TEMA on your prepared datasets, execute the following command:

python3 train.py

⬆ Back to top

🤝 Acknowledgement

Our implementation is based on the LAVIS framework. We express our sincere gratitude to their open-source contributions!

⬆ Back to top

🔗 Related Projects

Ecosystem & Other Works from our Team

ConeSep (CVPR'26) Web \| Code \|	Air-Know (CVPR'26) Web \| Code \|	ReTrack (AAAI'26) Web \| Code \| Paper
INTENT (AAAI'26) Web \| Code \| Paper	HUD (ACM MM'25) Web \| Code \| Paper	OFFSET (ACM MM'25) Web \| Code \| Paper
ENCODER (AAAI'25) Web \| Code \| Paper	HABIT (AAAI'26) Web \| Code \| Paper

📝 Citation

If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider citing our work:

@inproceedings{TEMA,
  title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
  author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2026}
}

⬆ Back to top

🫡 Support & Contributing

We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:

Open an Issue for discussions or bug reports.
Submit a Pull Request to improve the codebase.

⬆ Back to top

📄 License

This project is released under the terms of the LICENSE file included in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
lavis		lavis
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
data_utils.py		data_utils.py
datasets.py		datasets.py
test.py		test.py
train.py		train.py
utils.py		utils.py

ConeSep (CVPR'26) Web \| Code \|	Air-Know (CVPR'26) Web \| Code \|	ReTrack (AAAI'26) Web \| Code \| Paper
INTENT (AAAI'26) Web \| Code \| Paper	HUD (ACM MM'25) Web \| Code \| Paper	OFFSET (ACM MM'25) Web \| Code \| Paper
ENCODER (AAAI'25) Web \| Code \| Paper	HABIT (AAAI'26) Web \| Code \| Paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ACL 2026] ⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

📌 Introduction

📢 News

✨ Key Contributions

🏗️ Architecture

1. Data Generation Pipeline

2. TEMA Framework

🚀 Experimental Results

Table of Contents

📦 Installation

📂 Data Preparation

1. Fashion-domain Dataset: M-FashionIQ

2. Open-domain Dataset: M-CIRR

🚀 Quick Start

Training Phase

🤝 Acknowledgement

🔗 Related Projects

📝 Citation

🫡 Support & Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ACL 2026] ⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

📌 Introduction

📢 News

✨ Key Contributions

🏗️ Architecture

1. Data Generation Pipeline

2. TEMA Framework

🚀 Experimental Results

Table of Contents

📦 Installation

📂 Data Preparation

1. Fashion-domain Dataset: M-FashionIQ

2. Open-domain Dataset: M-CIRR

🚀 Quick Start

Training Phase

🤝 Acknowledgement

🔗 Related Projects

📝 Citation

🫡 Support & Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages