FG-CLIP: Fine-Grained Visual and Textual Alignment

Authors: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP s effectiveness in capturing fine-grained image details and improving overall model performance.
Researcher Affiliation Collaboration 1Beihang University 2360 AI Research. Correspondence to: Dawei Leng <EMAIL>.
Pseudocode No The paper describes its approach in Section 3 and provides an overview diagram in Figure 1, but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
Open Datasets Yes Code: https://github.com/360CVGroup/FG-CLIP Model: https://huggingface.co/qihoo360/fg-clip-large Dataset: https://huggingface.co/datasets/qihoo360/Fine HARD
Dataset Splits Yes Based on the fine-grained benchmark FG-OVD constructed by (Bianchi et al., 2024), we evaluate open-source image-text alignment models. Unlike previous benchmarks such as MSCOCO (Lin et al., 2014) and Flickr (Young et al., 2014), which rely on global information for matching, this benchmark focuses on identifying specific local regions within images. Each region has one corresponding positive description and ten negative descriptions, with the negative samples derived from the positive text. This benchmark primarily comprises four subsets of varying difficulty levels: hard, medium, easy, and trivial.
Hardware Specification Yes Utilizing a cluster of 160 910B NPUs, the data processing is completed in 30 days.
Software Dependencies No The entire training process employs Deep Speed s Zero-2 optimization technique and Bfloat16 precision to accelerate training, and the model is trained for one epoch. Training acceleration techniques include Deep Speed s Zero2 optimization, CUDA s TF32 technology, and Bfloat16 precision, and the model is trained for one epoch. The paper mentions software components like Deep Speed and CUDA but does not provide specific version numbers for them.
Experiment Setup Yes In the first stage... The batch size per NPU is set to 384. The learnable temperature parameter τ is initialized to 0.07. We utilize the Adam W optimizer with a learning rate of 1e-4, weight decay of 0.05, β1 of 0.9, β2 of 0.98, and warmup steps for the first 200 iterations... and the model is trained for one epoch. In the second stage... The batch size per GPU is set to 512. We employ the Adam W optimizer with a learning rate of 1e-6, weight decay of 0.001, β1 of 0.9, β2 of 0.98, and warmup steps for the first 50 iterations... and the model is trained for one epoch.