LR0.FM: LOW-RESOLUTION ZERO-SHOT CLASSIFICATION BENCHMARK FOR FOUNDATION MODELS

Authors: Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR.
Researcher Affiliation Academia Priyank Pathak1, Shyam Marjit2, Shruti Vyas1 & Yogesh S Rawat1 1University of Central Florida, 2IIIT Guwahati EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods textually and through figures (e.g., Figure 3, Figure 10) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at this link. https://ucf-crcv.github.io/lr0.fm
Open Datasets Yes Our study systematically examines the effects of resolution degradation, revealing key insights into how model size, pre-training dataset quality, and fine-tuning impact robustness in LR scenarios. ...15 diverse image classification datasets, ranging from large-scale datasets like Image Net (Deng et al., 2009) to fine-grained and texture-specific datasets like Oxford Pets (Parkhi et al., 2012) and DTD (Cimpoi et al., 2014). ...Pre-training is image-text pairs from datasets like Data Comp-1B (DC-1B) (Gadre et al., 2023), Conceptual Captions (CC) (Sharma et al., 2018), Conceptual 12M (C-12M) (Changpinyo et al., 2021).
Dataset Splits No The paper mentions that
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments. It only generally thanks "Steven Dick (UCF High-Performance Computing) and Rohit Gupta (UCF CRCV) for their help in generating the synthetic data set." which is a general computing environment statement without specifications.
Software Dependencies No The paper refers to various foundation models and tools (e.g., CLIP (Radford et al., 2021), LLaMA (Touvron et al., 2023), PIXART (Chen et al., 2023a), Ax tool (Bakshy et al., 2018), BERT (2019)) by citing their original publications, but it does not specify the version numbers of these or any other ancillary software dependencies used in their own implementation.
Experiment Setup Yes Models are trained with 7K captions (& 30 images/captions) in a multi-scale paradigm. EVA is trained for 200 epochs, while Meta CLIP and Open CLIP are for 10 epochs. ...Table 5: Ablation: EVA-B/16 trained with 7K captions and 50 images/caption. ... Not frozen means fine-tuning end-to-end... fine-tuning the last 4 blocks at 1/100 of the default learning rate...