Skip to content

iLearn-Lab/AAAI25-ENCODER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

(AAAI 2025) ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

1School of Software, Shandong University Β Β Β 
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Β Β Β 
2School of Data Science, City University of Hong Kong Β Β Β 
βœ‰Β Corresponding authorΒ Β 

AAAI 2025 Paper Project Page PyTorch Python stars

Accepted by AAAI 2025: A novel network designed to mine visual entities and modification actions, and bind implicit modification relations in Composed Image Retrieval (CIR).

πŸ“Œ Introduction

Welcome to the official repository for ENCODER (Entity miNing and modifiCation relation binDing nEtwoRk).

Existing CIR approaches often struggle with the modification relation between visual entities and modification actions due to three main challenges: irrelevant factor perturbation, vague semantic boundaries, and implicit modification relations. ENCODER tackles these by explicitly mining entities and actions, and securely binding them through modality-shared queries, achieving State-of-the-Art (SOTA) performance across multiple datasets.

⬆ Back to top

πŸ“’ News

  • [Mar 2026] πŸš€ All codes are transfered to the Github Repo.
  • [Oct 2025] πŸ› οΈ Based on feedback from some researchers, we found that different versions of open_clip can impact model performance. To ensure consistent performance, we have further clarified the environment dependencies (requirements.txt).
  • [Sep 2025] πŸ› οΈ We have updated the evaluation code and ENCODER checkpoints with "state_dict" version for stable evaluation.
  • [Apr 2025] πŸš€ We have released the full ENCODER code and checkpoints.
  • [Dec 2024] πŸ”₯ ENCODER has been accepted by AAAI 2025.

⬆ Back to top

✨ Key Features

Our framework introduces three innovative modules to achieve precise multimodal semantic alignment:

  • πŸ” Latent Factor Filter (LFF): Filters out irrelevant visual and textual factors using a dynamic threshold gating mechanism, keeping only the latent factors highly related to modification semantics.
  • πŸ”— Entity-Action Binding (EAB): Employs modality-shared Learnable Relation Queries (LRQ) to probe semantic boundaries. It dynamically mines visual entities and modification actions, learning their implicit relations to bind them effectively.
  • 🧩 Multi-scale Composition (MSC): Guided by the entity-action binding, this module performs multi-scale feature composition to precisely push the retrieved feature closer to the target image.
  • πŸ† SOTA Performance: Demonstrates superior generalization and achieves remarkable improvements (e.g., +19.8% on FashionIQ-Avg R@10) across both fashion-domain and open-domain datasets.

⬆ Back to top

πŸ—οΈ Architecture

ENCODER architecture

Figure 1. The overall architecture of ENCODER. It processes the reference image and modification text through LFF, binds entities and actions via EAB, and finally aggregates features in the MSC module.

⬆ Back to top

πŸ“Š Experiment Results

ENCODER consistently outperforms existing baselines on four widely-used datasets.

1. FashionIQ & Shoes Datasets

(Evaluated using Recall@K)

FashionIQ and Shoes Results

2. CIRR Dataset

(Evaluated using R@K and R_subset@K)

CIRR Results

⬆ Back to top


πŸ“‘ Table of Contents


πŸš€ Installation

1. Clone the repository

git clone [https://github.com/YourUsername/ENCODER.git](https://github.com/YourUsername/ENCODER.git)
cd ENCODER

2. Setup Environment We recommend using Conda to manage your environment:

conda create -n encoder_env python=3.9
conda activate encoder_env

# Install PyTorch (Ensure it matches your CUDA version)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)

# Install required packages
pip install -r requirements.txt

πŸ“‚ Data Preparation

ENCODER is evaluated on FashionIQ, Shoes, Fashion200K, and CIRR. Please download the datasets from their official sources and arrange them as follows. (You can modify the paths in datasets.py if needed).

Shoes

Download the Shoes dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

β”œβ”€β”€ Shoes
β”‚   β”œβ”€β”€ captions_shoes.json
β”‚   β”œβ”€β”€ eval_im_names.txt
β”‚   β”œβ”€β”€ relative_captions_shoes.json
β”‚   β”œβ”€β”€ train_im_names.txt
β”‚   β”œβ”€β”€ [womens_athletic_shoes | womens_boots | ...]
|   |   β”œβ”€β”€ [0 | 1]
|   |   β”œβ”€β”€ [img_womens_athletic_shoes_375.jpg | descr_womens_athletic_shoes_734.txt | ...]

FashionIQ

Download the FashionIQ dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

β”œβ”€β”€ FashionIQ
β”‚   β”œβ”€β”€ captions
|   |   β”œβ”€β”€ cap.dress.[train | val | test].json
|   |   β”œβ”€β”€ cap.toptee.[train | val | test].json
|   |   β”œβ”€β”€ cap.shirt.[train | val | test].json

β”‚   β”œβ”€β”€ image_splits
|   |   β”œβ”€β”€ split.dress.[train | val | test].json
|   |   β”œβ”€β”€ split.toptee.[train | val | test].json
|   |   β”œβ”€β”€ split.shirt.[train | val | test].json

β”‚   β”œβ”€β”€ dress
|   |   β”œβ”€β”€ [B000ALGQSY.jpg | B000AY2892.jpg | B000AYI3L4.jpg |...]

β”‚   β”œβ”€β”€ shirt
|   |   β”œβ”€β”€ [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]

β”‚   β”œβ”€β”€ toptee
|   |   β”œβ”€β”€ [B0000DZQD6.jpg | B000A33FTU.jpg | B000AS2OVA.jpg | ...]

Fashion200K

Download the Fashion200K dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

β”œβ”€β”€ Fashion200K
β”‚   β”œβ”€β”€ test_queries.txt

β”‚   β”œβ”€β”€ labels
|   |   β”œβ”€β”€ dress_[train | test]_detect_all.txt
|   |   β”œβ”€β”€ jacket_[train | test]_detect_all.txt
|   |   β”œβ”€β”€ pants_[train | test]_detect_all.txt
|   |   β”œβ”€β”€ skirt_[train | test]_detect_all.txt
|   |   β”œβ”€β”€ top_[train | test]_detect_all.txt

β”‚   β”œβ”€β”€ women
|   |   β”œβ”€β”€ [dresses | jackets | pants | skirts | tops]

CIRR

Download the CIRR dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

β”œβ”€β”€ CIRR
β”‚   β”œβ”€β”€ train
|   |   β”œβ”€β”€ [0 | 1 | 2 | ...]
|   |   |   β”œβ”€β”€ [train-10108-0-img0.png | train-10108-0-img1.png | ...]

β”‚   β”œβ”€β”€ dev
|   |   β”œβ”€β”€ [dev-0-0-img0.png | dev-0-0-img1.png | ...]

β”‚   β”œβ”€β”€ test1
|   |   β”œβ”€β”€ [test1-0-0-img0.png | test1-0-0-img1.png | ...]

β”‚   β”œβ”€β”€ cirr
|   |   β”œβ”€β”€ captions
|   |   |   β”œβ”€β”€ cap.rc2.[train | val | test1].json
|   |   β”œβ”€β”€ image_splits
|   |   |   β”œβ”€β”€ split.rc2.[train | val | test1].json

πŸƒβ€β™‚οΈ Quick Start

1. Training the Model

To train ENCODER from scratch, use the train.py script. By default, it uses the AdamW optimizer with a learning rate of 5e-5 for the main network and 1e-6 for the CLIP backbone.

python train.py \
    --dataset cirr \
    --data_path ./data/cirr \
    --batch_size 128 \
    --epochs 10 \
    --lr 5e-5 \
    --output_dir ./checkpoints/encoder_cirr

2. Evaluating the Model

Our checkpoints are released at Google Drive. To test your trained model on the validation set, use the evaluate_model.py or test.py script:

python3 evaluation_model.py 
--model_dir checkpoints/ENCODER_{Shoes,FashionIQ,Fashion200K,CIRR}.pth 
--dataset {shoes, fashioniq, fashion200k, cirr}
--cirr_path ""
--fashioniq_path ""
--shoes_path ""
--fashion200k_path ""

3. Test for CIRR

To generate the predictions file for uploading on the CIRR Evaluation Server using the our model, please execute the following command:

python src/cirr_test_submission.py model_path
model_path <str> : Path of the ENCODER checkpoint on CIRR, e.g. "checkpoints/ENCODER_CIRR.pt"

🧩 Project Structure

ENCODER/
β”œβ”€β”€ train.py                 # πŸš‚ Main training loop and optimization [Eq. 16]
β”œβ”€β”€ test.py                  # πŸ§ͺ General testing & inference logic
β”œβ”€β”€ evaluate_model.py        # πŸ“Š Script to calculate R@K metrics
β”œβ”€β”€ cirr_test_submission.py  # πŸ“€ Generates JSON files for CIRR server evaluation
β”œβ”€β”€ datasets.py              # πŸ—‚οΈ Dataloaders for FashionIQ, CIRR, etc.
β”œβ”€β”€ utils.py                 # πŸ› οΈ Helper functions, logging, and metric tracking
β”œβ”€β”€ token_wise_matching.py   # πŸ”— Implementation of Entity-Action Binding & LRQ
β”œβ”€β”€ model_try2.py            # 🧠 Core ENCODER network architecture (LFF, EAB, MSC)
└── requirements.txt         # πŸ“¦ Project dependencies

πŸ“ Citation

If you find this code or our paper useful for your research, please consider citing it πŸ₯°:

@inproceedings{ENCODER,
  title={Encoder: Entity mining and modification relation binding for composed image retrieval},
  author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={5},
  pages={5101--5109},
  year={2025}
}

🀝 Acknowledgements

The implementation of this project references the CLIP4Cir framework. We express our sincere gratitude to these open-source contributions!

βœ‰οΈ Contact

If you have any questions, feel free to open an issue or contact us at:

πŸ”— Related Projects

Ecosystem & Other Works from our Team

TEMA
TEMA (ACL'26)
Web | Code |
ConeSep
ConeSep (CVPR'26)
Web | Code |
Air-Know
Air-Know (CVPR'26)
Web | Code |
HABIT
HABIT (AAAI'26)
Web | Code | Paper
ReTrack
ReTrack (AAAI'26)
Web | Code | Paper
INTENT
INTENT (AAAI'26)
Web | Code | Paper
HUD
HUD (ACM MM'25)
Web | Code | Paper
OFFSET
OFFSET (ACM MM'25)
Web | Code | Paper

🫑 Support & Contributing

We welcome all forms of contributions! If you have any questions, ideas, or find a bug, please feel free to:

  • Open an Issue for discussions or bug reports.
  • Submit a Pull Request to improve the codebase.

⬆ Back to top

πŸ“„ License

This project is released under the terms of the LICENSE file included in this repository.


About

[AAAI 2025] Official repository of AAAI 2025 - ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages