reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Generalization Theory for Zero-Shot Prediction

Authors: Ronak Mehta, Zaid Harchaoui

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the first experiment, we create a simulated setting in which the residual dependence can be controlled and investigate whether it is indeed a determining factor for the empirical performance of CLIP (Radford et al., 2021) and VICReg (Bardes et al., 2022) models in practice. In the second experiment, we solve an image classification task in which the images have both captions and labels... To understand the dependence on ρY,Z at a sample level, we explore how downstream performance scales with the number of prompts M...
Researcher Affiliation	Academia	1Department of Statistics, University of Washington, Seattle. Correspondence to: Ronak Mehta <EMAIL>.
Pseudocode	No	The paper presents theoretical frameworks and mathematical derivations, but it does not include any clearly labeled pseudocode or algorithm blocks. Methods are described in text and equations.
Open Source Code	Yes	Appx. F contains further details of the experiments, and code for reproduction can be found at github.com/ronakdm/zeroshot.
Open Datasets	Yes	Our evaluation datasets include five standard benchmarks: the Describable Textures Dataset or DTD (Cimpoi et al., 2014), Flowers 102 (Nilsback and Zisserman, 2008), FGVC Aircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), and Image Net-1k (Deng et al., 2009). For some experiments, we make use of the Image Net-Captions dataset (Fang et al., 2023), which pairs a subset of Image Net images collected from Flickr with their original captions.
Dataset Splits	Yes	For the experiment behind Fig. 3, we design three in-distribution sub-tasks by randomly selecting collections of 50 classes (Y1, Y2, Y3) from each of 998 classes, reserving held-out prompting examples (Z1, Y1), . . . , (Z15,000, Y15,000), 100 for each of 150 classes. ... Using an evaluation set of approximately 25,000 examples from each sub-task, we compute the classification accuracy of this approach.
Hardware Specification	Yes	Experiments were run on a CPU/GPU workstation with 12 virtual cores, 126G of memory, and four NVIDIA TITAN Xp GPUs with 12G memory each.
Software Dependencies	No	The code was written in Python 3.10 with the environment given by the YAML file in the supplement. The Open CLIP and CLIP Benchmark repositories were either used directly or adapted in our codebase. (Only Python 3.10 has a specific version listed, which is insufficient for 'multiple key software components with their versions'.)
Experiment Setup	Yes	For the purpose of generation, we used a top-p hyperparameter of 0.9 and temperature hyperparameter of 0.99 for more diverse responses. ... Each model was trained for 30 epochs with the Adam W optimizer at a learning rate of 0.01. In the case of the VICReg objective, we used the parameterization of the original paper (Bardes et al., 2022) with the settings (γ, λ, µ, ν, ϵ) = (1, 25, 25, 1, 0.0001)... The encoder had a single hidden layer of 16 units and an output dimension of d.