reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LR0.FM: LOW-RESOLUTION ZERO-SHOT CLASSIFICATION BENCHMARK FOR FOUNDATION MODELS

Authors: Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classiﬁcation performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key ﬁndings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) ﬁne-tuned and higher resolution models are less robust against LR.
Researcher Affiliation	Academia	Priyank Pathak1, Shyam Marjit2, Shruti Vyas1 & Yogesh S Rawat1 1University of Central Florida, 2IIIT Guwahati EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods textually and through figures (e.g., Figure 3, Figure 10) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at this link. https://ucf-crcv.github.io/lr0.fm
Open Datasets	Yes	Our study systematically examines the effects of resolution degradation, revealing key insights into how model size, pre-training dataset quality, and ﬁne-tuning impact robustness in LR scenarios. ...15 diverse image classiﬁcation datasets, ranging from large-scale datasets like Image Net (Deng et al., 2009) to ﬁne-grained and texture-speciﬁc datasets like Oxford Pets (Parkhi et al., 2012) and DTD (Cimpoi et al., 2014). ...Pre-training is image-text pairs from datasets like Data Comp-1B (DC-1B) (Gadre et al., 2023), Conceptual Captions (CC) (Sharma et al., 2018), Conceptual 12M (C-12M) (Changpinyo et al., 2021).
Dataset Splits	No	The paper mentions that
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments. It only generally thanks "Steven Dick (UCF High-Performance Computing) and Rohit Gupta (UCF CRCV) for their help in generating the synthetic data set." which is a general computing environment statement without specifications.
Software Dependencies	No	The paper refers to various foundation models and tools (e.g., CLIP (Radford et al., 2021), LLaMA (Touvron et al., 2023), PIXART (Chen et al., 2023a), Ax tool (Bakshy et al., 2018), BERT (2019)) by citing their original publications, but it does not specify the version numbers of these or any other ancillary software dependencies used in their own implementation.
Experiment Setup	Yes	Models are trained with 7K captions (& 30 images/captions) in a multi-scale paradigm. EVA is trained for 200 epochs, while Meta CLIP and Open CLIP are for 10 epochs. ...Table 5: Ablation: EVA-B/16 trained with 7K captions and 50 images/caption. ... Not frozen means ﬁne-tuning end-to-end... ﬁne-tuning the last 4 blocks at 1/100 of the default learning rate...