Skip to content

iLearn-Lab/ACL26-TEMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TEMA logo

[ACL 2026] βš“ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

1School of Software, Shandong University Β Β Β 
2Department of Computing, Hong Kong Polytechnic University Β Β Β 
3School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) Β Β Β 
βœ‰Β Corresponding authorΒ Β 

Paper Author Page PyTorch Python stars

Official Repository: This is an open-source implementation of the paper "TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval".

πŸ“Œ Introduction

TEMA (Text-oriented Entity Mapping Architecture) is the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios while seamlessly accommodating simple modifications. Prevailing CIR setups rely on simple modification texts, which typically cover only a limited range of salient changes. This induces two critical limitations highly relevant to practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.

To bring CIR closer to real-world use, we introduce two instruction-rich multi-modification datasets: M-FashionIQ and M-CIRR. Through MMT parsing and entity mapping, TEMA actively perceives and structurally models these complex modifications to achieve precise retrieval.

⬆ Back to top

πŸ“’ News

  • [2026.04.07] πŸ”₯ TEMA was accepted by ACL 2026!
  • [2026.04.06] πŸš€ Released all training and evaluation codes.

⬆ Back to top

✨ Key Contributions

  • πŸ“Š New Benchmarks (M-FashionIQ & M-CIRR): We construct two instruction-intensive datasets that replace short, simplistic texts with Multi-Modification Texts (MMT). These are generated by MLLM and verified by human annotators to explicitly present constraint structures with multiple entities and clauses.
  • 🧠 MMT Parsing Assistant (PA): Designed to address "Insufficient Entity Coverage". It utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities through summarization and checks.
  • πŸ”— MMT-oriented Entity Mapping (EM): Tackles the "Clause-Entity Misalignment" issue. It introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side, stabilizing "one-to-many" relationship modeling.
  • πŸ† Superior Performance: Extensive experiments on four benchmark datasets demonstrate TEMA's superiority in both original and multi-modification scenarios.

⬆ Back to top

πŸ—οΈ Architecture

1. Data Generation Pipeline

data

Figure 1. Pipeline of the construction of our proposed multi-modification CIR datasets.

2. TEMA Framework

TEMA architecture

Figure 2. The overall framework of TEMA, comprising the MMT Parsing Assistant (PA) utilized only during training, and the MMT-oriented Entity Mapping (EM) module for multimodal alignment.

⬆ Back to top


πŸš€ Experimental Results

data

Figure 3. Performance comparison on M-FashionIQ and M-CIRR relative to R@K(%). The overall best results are in bold, while best results over baselines are underlined. The Avg metric in M-CIRR denotes (R@5 + Rsubset@1) / 2.

⬆ Back to top


Table of Contents


πŸ“¦ Installation

1. Clone the repository

git clone https://github.com/lee-zixu/ACL26-TEMA
cd TEMA

2. Setup Python Environment

The code is evaluated on Python 3.10.8 and PyTorch 2.5.1 using an NVIDIA A40 48G GPU. We recommend using Anaconda to create an isolated virtual environment:

conda create -n tema python=3.10.8
conda activate tema

# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install core dependencies
pip install transformers==4.25.0

⬆ Back to top


πŸ“‚ Data Preparation

We evaluated our framework on our newly proposed M-FashionIQ and M-CIRR datasets. Please prepare the data by following the steps below:

1. Fashion-domain Dataset: M-FashionIQ

First, download the FashionIQ dataset following the instructions in the official repository. After downloading, to obtain our proposed M-FashionIQ dataset, replace the captions folder with our provided mmt_captions.

Ensure the folder structure matches the following:
β”œβ”€β”€ M-FashionIQ
β”‚   β”œβ”€β”€ mmt_captions
β”‚   β”‚   β”œβ”€β”€ cap.dress.[train | val].mmt.json
β”‚   β”‚   β”œβ”€β”€ cap.toptee.[train | val].mmt.json
β”‚   β”‚   β”œβ”€β”€ cap.shirt.[train | val].mmt.json
β”‚   β”œβ”€β”€ image_splits
β”‚   β”‚   β”œβ”€β”€ split.dress.[train | val | test].json
β”‚   β”‚   β”œβ”€β”€ split.toptee.[train | val | test].json
β”‚   β”‚   β”œβ”€β”€ split.shirt.[train | val | test].json
β”‚   β”œβ”€β”€ dress
β”‚   β”‚   β”œβ”€β”€ [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
β”‚   β”œβ”€β”€ shirt
β”‚   β”‚   β”œβ”€β”€ [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
β”‚   β”œβ”€β”€ toptee
β”‚   β”‚   β”œβ”€β”€ [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]

2. Open-domain Dataset: M-CIRR

First, download the CIRR dataset following the instructions in the official repository. After downloading, to obtain our proposed M-CIRR dataset, replace the captions folder with our provided mmt_captions.

Ensure the folder structure matches the following:
β”œβ”€β”€ M-CIRR
β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ [0 | 1 | 2 | ...]
β”‚   β”‚   β”‚   β”œβ”€β”€ [train-10108-0-img0.png | train-10108-0-img1.png | ...]
β”‚   β”œβ”€β”€ dev
β”‚   β”‚   β”œβ”€β”€ [dev-0-0-img0.png | dev-0-0-img1.png | ...]
β”‚   β”œβ”€β”€ test1
β”‚   β”‚   β”œβ”€β”€ [test1-0-0-img0.png | test1-0-0-img1.png | ...]
β”‚   β”œβ”€β”€ mcirr
β”‚   β”‚   β”œβ”€β”€ mmt_captions
β”‚   β”‚   β”‚   β”œβ”€β”€ cap.rc2.[train | val | test1].mmt.json
β”‚   β”‚   β”œβ”€β”€ image_splits
β”‚   β”‚   β”‚   β”œβ”€β”€ split.rc2.[train | val | test1].json

⬆ Back to top


πŸš€ Quick Start

Training Phase

To start training TEMA on your prepared datasets, execute the following command:

python3 train.py

⬆ Back to top


🀝 Acknowledgement

Our implementation is based on the LAVIS framework. We express our sincere gratitude to their open-source contributions!

⬆ Back to top


πŸ”— Related Projects

Ecosystem & Other Works from our Team

ConeSep
ConeSep (CVPR'26)
Web | Code |
Air-Know
Air-Know (CVPR'26)
Web | Code |
ReTrack
ReTrack (AAAI'26)
Web | Code | Paper
INTENT
INTENT (AAAI'26)
Web | Code | Paper
HUD
HUD (ACM MM'25)
Web | Code | Paper
OFFSET
OFFSET (ACM MM'25)
Web | Code | Paper
ENCODER
ENCODER (AAAI'25)
Web | Code | Paper
HABIT
HABIT (AAAI'26)
Web | Code | Paper

πŸ“ Citation

If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider citing our work:

@inproceedings{TEMA,
  title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
  author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2026}
}

⬆ Back to top

🫑 Support & Contributing

We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:

  • Open an Issue for discussions or bug reports.
  • Submit a Pull Request to improve the codebase.

⬆ Back to top

πŸ“„ License

This project is released under the terms of the LICENSE file included in this repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages