reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning

Authors: Ruisu Zhang, Yicong Chen, Kangwook Lee

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion s ability to generate images with precise object counts. We also contribute two specialized datasets to train and evaluate CLIP s counting capabilities.
Researcher Affiliation	Academia	Ruisu Zhang EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison; Yicong Chen EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison; Kangwook Lee EMAIL Department of Electrical and Computer Engineering University of Wisconsin-Madison
Pseudocode	No	The paper describes methodologies using natural language and mathematical equations but does not present any distinct blocks labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our code is available at https://github.com/ UW-Madison-Lee-Lab/CLIP_Counting.
Open Datasets	Yes	To rigorously evaluate the counting capabilities of the CLIP model and the effectiveness of our methods, we have developed two datasets in addition to using existing benchmark Count Bench (Paiss et al., 2023). Count Bench is an object counting dataset, collected from the LAION-400M dataset (Schuhmann et al., 2021). It comprises 540 images in total, with each numerical count represented by 60 respective images of different types of objects. ... The first new benchmark Diverse Count consists of images automatically sourced from multiple sources, including the COCO Dataset (Lin et al., 2014), Conceptual 12M (Changpinyo et al., 2021), YFCC100M (Thomee et al., 2016), and SBU Captions Dataset (Ordonez et al., 2011). ... We track performance across epochs by saving checkpoints and selecting the one with the lowest validation loss for final evaluations. We first evaluate each model on the Diverse Count test set, followed by Count Bench to measure generalization across data distributions. To evaluate the impact of fine-tuning on unrelated tasks, we test the models on CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), Caltech101 (Fei-Fei et al., 2004), Euro SAT (Helber et al., 2018; 2019), and Food101 (Bossard et al., 2014).
Dataset Splits	Yes	We divide our new dataset, Diverse Count, into training, validation, and test sets in a 6:2:2 ratio.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU models, or detailed cloud instance specifications) for running its experiments.
Software Dependencies	No	The paper mentions using Stable Diffusion model, stable-diffusion-v1-4, sourced from Hugging Face, YOLOv9 model (Wang & Liao, 2024) for object detection, and Chat GPT (Open AI, 2024), specifically gpt-4-turbo-2024-04-09. However, it does not provide specific version numbers for common software libraries or frameworks like Python, PyTorch, or TensorFlow, which would be necessary for full reproducibility.
Experiment Setup	Yes	For training the counting vectors, we set a higher learning rate of 10 3. We use lower learning rates of 10 4 for fine-tuning the text projection layer and 10 5 for the entire text model to minimize overfitting. We run three experiments with different random seeds and report the average test set scores to ensure robustness. We track performance across epochs by saving checkpoints and selecting the one with the lowest validation loss for final evaluations.