FG-CLIP: Fine-Grained Visual and Textual Alignment
Authors: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP s effectiveness in capturing fine-grained image details and improving overall model performance. |
| Researcher Affiliation | Collaboration | 1Beihang University 2360 AI Research. Correspondence to: Dawei Leng <EMAIL>. |
| Pseudocode | No | The paper describes its approach in Section 3 and provides an overview diagram in Figure 1, but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP. |
| Open Datasets | Yes | Code: https://github.com/360CVGroup/FG-CLIP Model: https://huggingface.co/qihoo360/fg-clip-large Dataset: https://huggingface.co/datasets/qihoo360/Fine HARD |
| Dataset Splits | Yes | Based on the fine-grained benchmark FG-OVD constructed by (Bianchi et al., 2024), we evaluate open-source image-text alignment models. Unlike previous benchmarks such as MSCOCO (Lin et al., 2014) and Flickr (Young et al., 2014), which rely on global information for matching, this benchmark focuses on identifying specific local regions within images. Each region has one corresponding positive description and ten negative descriptions, with the negative samples derived from the positive text. This benchmark primarily comprises four subsets of varying difficulty levels: hard, medium, easy, and trivial. |
| Hardware Specification | Yes | Utilizing a cluster of 160 910B NPUs, the data processing is completed in 30 days. |
| Software Dependencies | No | The entire training process employs Deep Speed s Zero-2 optimization technique and Bfloat16 precision to accelerate training, and the model is trained for one epoch. Training acceleration techniques include Deep Speed s Zero2 optimization, CUDA s TF32 technology, and Bfloat16 precision, and the model is trained for one epoch. The paper mentions software components like Deep Speed and CUDA but does not provide specific version numbers for them. |
| Experiment Setup | Yes | In the first stage... The batch size per NPU is set to 384. The learnable temperature parameter τ is initialized to 0.07. We utilize the Adam W optimizer with a learning rate of 1e-4, weight decay of 0.05, β1 of 0.9, β2 of 0.98, and warmup steps for the first 200 iterations... and the model is trained for one epoch. In the second stage... The batch size per GPU is set to 512. We employ the Adam W optimizer with a learning rate of 1e-6, weight decay of 0.001, β1 of 0.9, β2 of 0.98, and warmup steps for the first 50 iterations... and the model is trained for one epoch. |