[ACL 2026] β TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
2Department of Computing, Hong Kong Polytechnic University Β Β Β
3School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) Β Β Β
βΒ Corresponding authorΒ Β
Official Repository: This is an open-source implementation of the paper "TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval".
TEMA (Text-oriented Entity Mapping Architecture) is the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios while seamlessly accommodating simple modifications. Prevailing CIR setups rely on simple modification texts, which typically cover only a limited range of salient changes. This induces two critical limitations highly relevant to practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.
To bring CIR closer to real-world use, we introduce two instruction-rich multi-modification datasets: M-FashionIQ and M-CIRR. Through MMT parsing and entity mapping, TEMA actively perceives and structurally models these complex modifications to achieve precise retrieval.
- [2026.04.07] π₯ TEMA was accepted by ACL 2026!
- [2026.04.06] π Released all training and evaluation codes.
- π New Benchmarks (M-FashionIQ & M-CIRR): We construct two instruction-intensive datasets that replace short, simplistic texts with Multi-Modification Texts (MMT). These are generated by MLLM and verified by human annotators to explicitly present constraint structures with multiple entities and clauses.
- π§ MMT Parsing Assistant (PA): Designed to address "Insufficient Entity Coverage". It utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities through summarization and checks.
- π MMT-oriented Entity Mapping (EM): Tackles the "Clause-Entity Misalignment" issue. It introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side, stabilizing "one-to-many" relationship modeling.
- π Superior Performance: Extensive experiments on four benchmark datasets demonstrate TEMA's superiority in both original and multi-modification scenarios.
Figure 3. Performance comparison on M-FashionIQ and M-CIRR relative to R@K(%). The overall best results are in bold, while best results over baselines are underlined. The Avg metric in M-CIRR denotes (R@5 + Rsubset@1) / 2.
- Introduction
- Key Contributions
- Architecture
- Experimental Results
- Installation
- Data Preparation
- Quick Start
- Acknowledgement
- Citation
1. Clone the repository
git clone https://github.com/lee-zixu/ACL26-TEMA
cd TEMA2. Setup Python Environment
The code is evaluated on Python 3.10.8 and PyTorch 2.5.1 using an NVIDIA A40 48G GPU. We recommend using Anaconda to create an isolated virtual environment:
conda create -n tema python=3.10.8
conda activate tema
# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install core dependencies
pip install transformers==4.25.0We evaluated our framework on our newly proposed M-FashionIQ and M-CIRR datasets. Please prepare the data by following the steps below:
First, download the FashionIQ dataset following the instructions in the official repository. After downloading, to obtain our proposed M-FashionIQ dataset, replace the captions folder with our provided mmt_captions.
Ensure the folder structure matches the following:
βββ M-FashionIQ
β βββ mmt_captions
β β βββ cap.dress.[train | val].mmt.json
β β βββ cap.toptee.[train | val].mmt.json
β β βββ cap.shirt.[train | val].mmt.json
β βββ image_splits
β β βββ split.dress.[train | val | test].json
β β βββ split.toptee.[train | val | test].json
β β βββ split.shirt.[train | val | test].json
β βββ dress
β β βββ [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
β βββ shirt
β β βββ [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
β βββ toptee
β β βββ [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]
First, download the CIRR dataset following the instructions in the official repository. After downloading, to obtain our proposed M-CIRR dataset, replace the captions folder with our provided mmt_captions.
Ensure the folder structure matches the following:
βββ M-CIRR
β βββ train
β β βββ [0 | 1 | 2 | ...]
β β β βββ [train-10108-0-img0.png | train-10108-0-img1.png | ...]
β βββ dev
β β βββ [dev-0-0-img0.png | dev-0-0-img1.png | ...]
β βββ test1
β β βββ [test1-0-0-img0.png | test1-0-0-img1.png | ...]
β βββ mcirr
β β βββ mmt_captions
β β β βββ cap.rc2.[train | val | test1].mmt.json
β β βββ image_splits
β β β βββ split.rc2.[train | val | test1].json
To start training TEMA on your prepared datasets, execute the following command:
python3 train.pyOur implementation is based on the LAVIS framework. We express our sincere gratitude to their open-source contributions!
Ecosystem & Other Works from our Team
![]() ConeSep (CVPR'26) Web | Code | |
![]() Air-Know (CVPR'26) Web | Code | |
![]() ReTrack (AAAI'26) Web | Code | Paper |
![]() INTENT (AAAI'26) Web | Code | Paper |
![]() HUD (ACM MM'25) Web | Code | Paper |
![]() OFFSET (ACM MM'25) Web | Code | Paper |
![]() ENCODER (AAAI'25) Web | Code | Paper |
![]() HABIT (AAAI'26) Web | Code | Paper |
If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider citing our work:
@inproceedings{TEMA,
title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2026}
}We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:
- Open an Issue for discussions or bug reports.
- Submit a Pull Request to improve the codebase.
This project is released under the terms of the LICENSE file included in this repository.












