reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FG-CLIP: Fine-Grained Visual and Textual Alignment

Authors: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP s effectiveness in capturing fine-grained image details and improving overall model performance.
Researcher Affiliation	Collaboration	1Beihang University 2360 AI Research. Correspondence to: Dawei Leng <EMAIL>.
Pseudocode	No	The paper describes its approach in Section 3 and provides an overview diagram in Figure 1, but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
Open Datasets	Yes	Code: https://github.com/360CVGroup/FG-CLIP Model: https://huggingface.co/qihoo360/fg-clip-large Dataset: https://huggingface.co/datasets/qihoo360/Fine HARD
Dataset Splits	Yes	Based on the fine-grained benchmark FG-OVD constructed by (Bianchi et al., 2024), we evaluate open-source image-text alignment models. Unlike previous benchmarks such as MSCOCO (Lin et al., 2014) and Flickr (Young et al., 2014), which rely on global information for matching, this benchmark focuses on identifying specific local regions within images. Each region has one corresponding positive description and ten negative descriptions, with the negative samples derived from the positive text. This benchmark primarily comprises four subsets of varying difficulty levels: hard, medium, easy, and trivial.
Hardware Specification	Yes	Utilizing a cluster of 160 910B NPUs, the data processing is completed in 30 days.
Software Dependencies	No	The entire training process employs Deep Speed s Zero-2 optimization technique and Bfloat16 precision to accelerate training, and the model is trained for one epoch. Training acceleration techniques include Deep Speed s Zero2 optimization, CUDA s TF32 technology, and Bfloat16 precision, and the model is trained for one epoch. The paper mentions software components like Deep Speed and CUDA but does not provide specific version numbers for them.
Experiment Setup	Yes	In the first stage... The batch size per NPU is set to 384. The learnable temperature parameter τ is initialized to 0.07. We utilize the Adam W optimizer with a learning rate of 1e-4, weight decay of 0.05, β1 of 0.9, β2 of 0.98, and warmup steps for the first 200 iterations... and the model is trained for one epoch. In the second stage... The batch size per GPU is set to 512. We employ the Adam W optimizer with a learning rate of 1e-6, weight decay of 0.001, β1 of 0.9, β2 of 0.98, and warmup steps for the first 50 iterations... and the model is trained for one epoch.