Skip to content

iLearn-Lab/NeurIPS25-VTFSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li1, 2, Qiangchang Wang1*, Xianjing Meng1, Zhibin Wu1, Yilong Yin1*
1School of Software, Shandong University     2Shenzhen Loop Area Institute   
*Corresponding author  

NeurIPS 2025 arXiv PyTorch Language License Author Page

NeurIPS 2025

This repo is the official code implementation of the NeurIPS 2025 paper "VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning". [arXiv] [paper]

We propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment mechanism.

Poster

Standard Few-Shot Classification Results

Dataset 1-Shot 5-Way 5-Shot 5-Way
MiniImageNet 83.66 ± 0.31 88.38 ± 0.25
CIFAR-FS 88.67 ± 0.33 91.45 ± 0.46
TieredImageNet 88.02 ± 0.34 91.97 ± 0.27
FC100 57.99 ± 0.40 67.68 ± 0.38

Fine-grained Few-Shot Classification Results

Dataset 1-Shot 5-Way 5-Shot 5-Way
CUB 91.08 ± 0.28 94.63 ± 0.19
Dogs 86.58 ± 0.30 90.69 ± 0.25
Cars 92.95 ± 0.24 96.62 ± 0.15

Cross Domain Few-shot Classification Results

Dataset 1-Shot 5-Way 5-Shot 5-Way
CUB 66.86 ± 0.47 81.02 ± 0.36
Places 73.68 ± 0.41 81.52 ± 0.33
Plantae 45.90 ± 0.40 61.54 ± 0.38

Usage

Requirements

pip instal -r requirements.txt

Datasets

  • Download link: Hugging Face or Google Cloud
  • Please download the dataset you need and then put the xxx.tar.gz in ./dataset directory:
cd ./dataset
tar -xvzf xxx.tar.gz

Synthetic Images

  • Download link: Hugging Face or Google Cloud
  • Please download the directory you need and then put them in ./data directory:

Reproducing Results with Pretrained Checkpoints

To directly reproduce the results reported in the paper using our trained models:

  1. Pre-training and meta-tuning checkpoints
  • Download link: Hugging Face or Google Cloud
  • Please download the checkpoints you need and then put them into ./checkpoints directory.
  1. Run Inference Run the evaluation script with the desired settings. For example, to evaluate on the miniImageNet dataset with a 5-way 5-shot configuration:
python test.py \
    --dataset miniImageNet \
    --way 5 \
    --shot 5 \
    --episode 2000 \
    --image_size 224 \
    --gpu 0

This will evaluate the pretrained model on 2000 few-shot episodes using the specified configuration.

  1. Expected Results You should observe performance consistent with the results reported in our paper. If results slightly vary, it may be due to sampling randomness; we recommend running with a fixed seed or averaging over multiple runs.

Coming Soon

We are actively working on releasing more pretrained weights across additional datasets (e.g., tieredImageNet, CUB, FC100), as well as the generated class descriptions and synthetic images used in VT-FSL. These resources will be made publicly available to further support reproducibility and research on multimodal few-shot learning. Stay tuned for updates!

Training from Scratch

If you prefer to train the model from scratch instead of using the provided pretrained weights, follow the two-stage training process described below. We provide example scripts for the miniImageNet dataset.

  1. Pre-train the Feature Extractor Run the following command to pre-train the visual backbone on the base split of miniImageNet:
python pretrain.py \
    --dataset miniImageNet \
    --batch_size 512 \
    --image_size 224 \
    --backbone visformer-t \
    --lr 5e-4 \
    --epoch 800 \
    --gpu 0
  1. Meta-tune with VT-FSL After pretraining, meta-tune the model for few-shot learning using the episodic training strategy.
  • For 5-way 1-shot setting:
python train.py \
    --dataset miniImageNet \
    --way 5 \
    --shot 1 \
    --image_size 224 \
    --backbone visformer-t \
    --lr 5e-4 \
    --epoch 100 \
    --t 0.2 \
    --gpu 0
  • For 5-way 5-shot setting:
python train.py \
    --dataset miniImageNet \
    --way 5 \
    --shot 5 \
    --image_size 224 \
    --backbone visformer-t \
    --lr 5e-4 \
    --epoch 100 \
    --t 0.2 \
    --gpu 0

After training, the model checkpoints will be automatically saved for evaluation.

Citation

If you find this repo helpful in your research or work, please cite the following paper.

@article{li2025vt,
  title={VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning},
  author={Li, Wenhao and Wang, Qiangchang and Meng, Xianjing and Wu, Zhibin and Yin, Yilong},
  journal={arXiv preprint arXiv:2509.25033},
  year={2025}
}

Releases

No releases published

Packages

 
 
 

Contributors

Languages