1School of Software, Shandong University Β Β Β 2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Β Β Β
2School of Data Science, City University of Hong Kong Β Β Β
βΒ Corresponding authorΒ Β
Accepted by AAAI 2025: A novel network designed to mine visual entities and modification actions, and bind implicit modification relations in Composed Image Retrieval (CIR).
Welcome to the official repository for ENCODER (Entity miNing and modifiCation relation binDing nEtwoRk).
Existing CIR approaches often struggle with the modification relation between visual entities and modification actions due to three main challenges: irrelevant factor perturbation, vague semantic boundaries, and implicit modification relations. ENCODER tackles these by explicitly mining entities and actions, and securely binding them through modality-shared queries, achieving State-of-the-Art (SOTA) performance across multiple datasets.
- [Mar 2026] π All codes are transfered to the Github Repo.
- [Oct 2025] π οΈ Based on feedback from some researchers, we found that different versions of open_clip can impact model performance. To ensure consistent performance, we have further clarified the environment dependencies (requirements.txt).
- [Sep 2025] π οΈ We have updated the evaluation code and ENCODER checkpoints with "state_dict" version for stable evaluation.
- [Apr 2025] π We have released the full ENCODER code and checkpoints.
- [Dec 2024] π₯ ENCODER has been accepted by AAAI 2025.
Our framework introduces three innovative modules to achieve precise multimodal semantic alignment:
- π Latent Factor Filter (LFF): Filters out irrelevant visual and textual factors using a dynamic threshold gating mechanism, keeping only the latent factors highly related to modification semantics.
- π Entity-Action Binding (EAB): Employs modality-shared Learnable Relation Queries (LRQ) to probe semantic boundaries. It dynamically mines visual entities and modification actions, learning their implicit relations to bind them effectively.
- π§© Multi-scale Composition (MSC): Guided by the entity-action binding, this module performs multi-scale feature composition to precisely push the retrieved feature closer to the target image.
- π SOTA Performance: Demonstrates superior generalization and achieves remarkable improvements (e.g., +19.8% on FashionIQ-Avg R@10) across both fashion-domain and open-domain datasets.
ENCODER consistently outperforms existing baselines on four widely-used datasets.
(Evaluated using Recall@K)
(Evaluated using R@K and R_subset@K)
- π Introduction
- π’ News
- β¨ Key Features
- ποΈ Architecture
- π Experiment Results
- π Installation
- π Data Preparation
- πββοΈ Quick Start
- π§© Project Structure
- π Citation
- π€ Acknowledgements
- βοΈ Contact
1. Clone the repository
git clone [https://github.com/YourUsername/ENCODER.git](https://github.com/YourUsername/ENCODER.git)
cd ENCODER2. Setup Environment We recommend using Conda to manage your environment:
conda create -n encoder_env python=3.9
conda activate encoder_env
# Install PyTorch (Ensure it matches your CUDA version)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
# Install required packages
pip install -r requirements.txtENCODER is evaluated on FashionIQ, Shoes, Fashion200K, and CIRR. Please download the datasets from their official sources and arrange them as follows. (You can modify the paths in datasets.py if needed).
Download the Shoes dataset following the instructions in the official repository.
After downloading the dataset, ensure that the folder structure matches the following:
βββ Shoes
β βββ captions_shoes.json
β βββ eval_im_names.txt
β βββ relative_captions_shoes.json
β βββ train_im_names.txt
β βββ [womens_athletic_shoes | womens_boots | ...]
| | βββ [0 | 1]
| | βββ [img_womens_athletic_shoes_375.jpg | descr_womens_athletic_shoes_734.txt | ...]
Download the FashionIQ dataset following the instructions in the official repository.
After downloading the dataset, ensure that the folder structure matches the following:
βββ FashionIQ
β βββ captions
| | βββ cap.dress.[train | val | test].json
| | βββ cap.toptee.[train | val | test].json
| | βββ cap.shirt.[train | val | test].json
β βββ image_splits
| | βββ split.dress.[train | val | test].json
| | βββ split.toptee.[train | val | test].json
| | βββ split.shirt.[train | val | test].json
β βββ dress
| | βββ [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]
β βββ shirt
| | βββ [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]
β βββ toptee
| | βββ [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]
Download the Fashion200K dataset following the instructions in the official repository.
After downloading the dataset, ensure that the folder structure matches the following:
βββ Fashion200K
β βββ test_queries.txt
β βββ labels
| | βββ dress_[train | test]_detect_all.txt
| | βββ jacket_[train | test]_detect_all.txt
| | βββ pants_[train | test]_detect_all.txt
| | βββ skirt_[train | test]_detect_all.txt
| | βββ top_[train | test]_detect_all.txt
β βββ women
| | βββ [dresses | jackets | pants | skirts | tops]
Download the CIRR dataset following the instructions in the official repository.
After downloading the dataset, ensure that the folder structure matches the following:
βββ CIRR
β βββ train
| | βββ [0 | 1 | 2 | ...]
| | | βββ [train-10108-0-img0.png | train-10108-0-img1.png | ...]
β βββ dev
| | βββ [dev-0-0-img0.png | dev-0-0-img1.png | ...]
β βββ test1
| | βββ [test1-0-0-img0.png | test1-0-0-img1.png | ...]
β βββ cirr
| | βββ captions
| | | βββ cap.rc2.[train | val | test1].json
| | βββ image_splits
| | | βββ split.rc2.[train | val | test1].json
To train ENCODER from scratch, use the train.py script. By default, it uses the AdamW optimizer with a learning rate of 5e-5 for the main network and 1e-6 for the CLIP backbone.
python train.py \
--dataset cirr \
--data_path ./data/cirr \
--batch_size 128 \
--epochs 10 \
--lr 5e-5 \
--output_dir ./checkpoints/encoder_cirr
Our checkpoints are released at Google Drive. To test your trained model on the validation set, use the evaluate_model.py or test.py script:
python3 evaluation_model.py
--model_dir checkpoints/ENCODER_{Shoes,FashionIQ,Fashion200K,CIRR}.pth
--dataset {shoes, fashioniq, fashion200k, cirr}
--cirr_path ""
--fashioniq_path ""
--shoes_path ""
--fashion200k_path ""
To generate the predictions file for uploading on the CIRR Evaluation Server using the our model, please execute the following command:
python src/cirr_test_submission.py model_pathmodel_path <str> : Path of the ENCODER checkpoint on CIRR, e.g. "checkpoints/ENCODER_CIRR.pt"
ENCODER/
βββ train.py # π Main training loop and optimization [Eq. 16]
βββ test.py # π§ͺ General testing & inference logic
βββ evaluate_model.py # π Script to calculate R@K metrics
βββ cirr_test_submission.py # π€ Generates JSON files for CIRR server evaluation
βββ datasets.py # ποΈ Dataloaders for FashionIQ, CIRR, etc.
βββ utils.py # π οΈ Helper functions, logging, and metric tracking
βββ token_wise_matching.py # π Implementation of Entity-Action Binding & LRQ
βββ model_try2.py # π§ Core ENCODER network architecture (LFF, EAB, MSC)
βββ requirements.txt # π¦ Project dependencies
If you find this code or our paper useful for your research, please consider citing it π₯°:
@inproceedings{ENCODER,
title={Encoder: Entity mining and modification relation binding for composed image retrieval},
author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={5},
pages={5101--5109},
year={2025}
}
The implementation of this project references the CLIP4Cir framework. We express our sincere gratitude to these open-source contributions!
If you have any questions, feel free to open an issue or contact us at:
- Zixu Li: lzx@mail.sdu.edu.cn
Ecosystem & Other Works from our Team
![]() TEMA (ACL'26) Web | Code | |
![]() ConeSep (CVPR'26) Web | Code | |
![]() Air-Know (CVPR'26) Web | Code | |
![]() HABIT (AAAI'26) Web | Code | Paper |
![]() ReTrack (AAAI'26) Web | Code | Paper |
![]() INTENT (AAAI'26) Web | Code | Paper |
![]() HUD (ACM MM'25) Web | Code | Paper |
![]() OFFSET (ACM MM'25) Web | Code | Paper |
We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:
- Open an Issue for discussions or bug reports.
- Submit a Pull Request to improve the codebase.
This project is released under the terms of the LICENSE file included in this repository.











