GitHub - iLearn-Lab/TMM25-RA-BLIP: TMM 2025-RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Muhe Ding¹, Yang Ma², Pengda Qin³, Jianlong Wu^1✉, Yuhong Li³, Liqiang Nie¹

¹School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
²School of Computer Science, University of Sydney, Sydney, Australia
³Security Department, Alibaba Group, Hangzhou, China

^✉Corresponding authors

🚀 Abstract

Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. MLLMs involve significant external knowledge within their parameters; however, it is challenging to continually update these models with the latest knowledge, which involves huge computational costs and poor interpretability. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.

In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs. We first leverage the question to instruct the extraction of visual information through interactions with one set of learnable queries, minimizing irrelevant interference and redundancy during retrieval and generation. Besides, we introduce a pre-trained multimodal adaptive fusion module to achieve question text-to-multimodal retrieval and integration of multimodal knowledge by projecting visual and language modalities into a unified semantic space. Furthermore, we present an Adaptive Selection Knowledge Generation (ASKG) strategy to train the generator to autonomously discern the relevance of retrieved knowledge, which realizes excellent denoising performance. Extensive experiments on open multimodal question-answering datasets demonstrate that RA-BLIP achieves significant performance and surpasses the state-of-the-art retrieval-augmented models.

🧩 Framework

Overview of RA-BLIP: 1. Query-Instructed Visual Extraction: We leverage the input question to guide the extraction of relevant visual features using learnable queries. This drastically reduces visual redundancy and irrelevant interference. 2. Multimodal Adaptive Fusion: A pre-trained fusion module projects both visual and textual modalities into a unified semantic space, enabling seamless text-to-multimodal retrieval. 3. Adaptive Selection Knowledge Generation (ASKG): The generator is trained to autonomously evaluate and filter retrieved knowledge, providing highly effective denoising before the final answer generation.

⚙️ Getting Started

Installation

We recommend setting up the environment using conda:

# Create conda environment
conda create -n rablip python=3.9
conda activate rablip

# Install PyTorch
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)

# Install required packages
git clone [https://github.com/YourOrg/RA-BLIP.git](https://github.com/YourOrg/RA-BLIP.git)
cd RA-BLIP
pip install -r requirements.txt


### Data Preparation

Download the pre-training datasets and downstream Multimodal QA datasets (e.g., OK-VQA, A-OKVQA). Organize them in the `./data` directory as follows:

```text
RA-BLIP/
├── data/
│   ├── pretrain/
│   ├── okvqa/
│   └── aokvqa/

Checkpoints

Download the base MLLM weights and the pre-trained RA-BLIP retrieval/fusion modules from our Huggingface repository and place them in the ./checkpoints folder.

🏃 Training & Evaluation

Pre-training

To run the Bootstrapping Language-Image Pre-training with retrieval augmentation:

python run_pretrain.py \
    --config ./configs/pretrain.yaml \
    --output_dir ./output/pretrain

📊 Main Results

Extensive experiments demonstrate that RA-BLIP achieves state-of-the-art performance on open multimodal question-answering datasets, outperforming existing retrieval-augmented baseline models.

Table 1: Performance comparison of RA-BLIP against state-of-the-art retrieval-augmented models.

🤗 Citation

If you find this work useful for your research, please kindly cite our TMM 2025 paper:

@article{rablip2025,
  title={RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training},
  author={Ding, Muhe and Ma, Yang and Qin, Pengda and Wu, Jianlong and Li, Yuhong and Nie, Liqiang},
  journal={IEEE Transactions on Multimedia (TMM)},
  year={2025},
  url={[https://ieeexplore.ieee.org/document/10844992](https://ieeexplore.ieee.org/document/10844992)}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

🚀 Abstract

🧩 Framework

⚙️ Getting Started

Installation

Checkpoints

🏃 Training & Evaluation

Pre-training

📊 Main Results

🤗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

🚀 Abstract

🧩 Framework

⚙️ Getting Started

Installation

Checkpoints

🏃 Training & Evaluation

Pre-training

📊 Main Results

🤗 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages