A Generalization Theory for Zero-Shot Prediction
Authors: Ronak Mehta, Zaid Harchaoui
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the first experiment, we create a simulated setting in which the residual dependence can be controlled and investigate whether it is indeed a determining factor for the empirical performance of CLIP (Radford et al., 2021) and VICReg (Bardes et al., 2022) models in practice. In the second experiment, we solve an image classification task in which the images have both captions and labels... To understand the dependence on ρY,Z at a sample level, we explore how downstream performance scales with the number of prompts M... |
| Researcher Affiliation | Academia | 1Department of Statistics, University of Washington, Seattle. Correspondence to: Ronak Mehta <EMAIL>. |
| Pseudocode | No | The paper presents theoretical frameworks and mathematical derivations, but it does not include any clearly labeled pseudocode or algorithm blocks. Methods are described in text and equations. |
| Open Source Code | Yes | Appx. F contains further details of the experiments, and code for reproduction can be found at github.com/ronakdm/zeroshot. |
| Open Datasets | Yes | Our evaluation datasets include five standard benchmarks: the Describable Textures Dataset or DTD (Cimpoi et al., 2014), Flowers 102 (Nilsback and Zisserman, 2008), FGVC Aircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), and Image Net-1k (Deng et al., 2009). For some experiments, we make use of the Image Net-Captions dataset (Fang et al., 2023), which pairs a subset of Image Net images collected from Flickr with their original captions. |
| Dataset Splits | Yes | For the experiment behind Fig. 3, we design three in-distribution sub-tasks by randomly selecting collections of 50 classes (Y1, Y2, Y3) from each of 998 classes, reserving held-out prompting examples (Z1, Y1), . . . , (Z15,000, Y15,000), 100 for each of 150 classes. ... Using an evaluation set of approximately 25,000 examples from each sub-task, we compute the classification accuracy of this approach. |
| Hardware Specification | Yes | Experiments were run on a CPU/GPU workstation with 12 virtual cores, 126G of memory, and four NVIDIA TITAN Xp GPUs with 12G memory each. |
| Software Dependencies | No | The code was written in Python 3.10 with the environment given by the YAML file in the supplement. The Open CLIP and CLIP Benchmark repositories were either used directly or adapted in our codebase. (Only Python 3.10 has a specific version listed, which is insufficient for 'multiple key software components with their versions'.) |
| Experiment Setup | Yes | For the purpose of generation, we used a top-p hyperparameter of 0.9 and temperature hyperparameter of 0.99 for more diverse responses. ... Each model was trained for 30 epochs with the Adam W optimizer at a learning rate of 0.01. In the case of the VICReg objective, we used the parameterization of the original paper (Bardes et al., 2022) with the settings (γ, λ, µ, ν, ϵ) = (1, 25, 25, 1, 0.0001)... The encoder had a single hidden layer of 16 units and an output dimension of d. |