reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Authors: Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we introduce LAION-C as a benchmark alternative for Image Net-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of stateof-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data.
Researcher Affiliation	Collaboration	1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2ELLIS Institute T ubingen 3T ubingen AI Center 4Google Deep Mind. Correspondence to: Fanfei Li <EMAIL>.
Pseudocode	No	The paper describes the construction of new OOD distortions (e.g., Mosaic, Glitched, Vertical Lines, Geometric Shapes, Stickers, Luminance Checkerboard) with parameters for different intensity levels (Tables 2-7). However, it does not present these procedures as formal pseudocode or algorithm blocks.
Open Source Code	Yes	The evaluation code for LAION-C is publicly available at: https://github.com/Fanfei Li/LAION-C.
Open Datasets	Yes	The LAION-C dataset is published on Zenodo. A link to the dataset is provided via the Git Hub repository.
Dataset Splits	No	Since the dataset is primarily used for benchmarking purposes, splitting specifics are not provided. Essentially, the entire dataset is a validation set.
Hardware Specification	Yes	Our experiments were conducted in a darkened cabin, using a 22 VIEWPixx 3D light LCD monitor (VPixx Technologies, Saint-Bruno, Canada) at a refresh rate of 120 Hz (scanning backlight mode on). The screen measures 484 302 mm, at a resolution of 1920 1200 pixels. ... The experiment was implemented using the Psychophysics Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The Math Works, Inc., Natick, Massachusetts, United States) using a 12-core desktop computer (AMD HD7970 graphics card Tahiti by AMD, Sunnyvale, California, United States) running Kubuntu 14.04 LTS.
Software Dependencies	Yes	The experiment was implemented using the Psychophysics Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The Math Works, Inc., Natick, Massachusetts, United States) using a 12-core desktop computer (AMD HD7970 graphics card Tahiti by AMD, Sunnyvale, California, United States) running Kubuntu 14.04 LTS.
Experiment Setup	Yes	Participants were given 2.5 s to view each image, followed by a 2 s response window to classify the image by clicking on a set of icons. ... To motivate high performance, a monetary bonus was awarded for surpassing fixed, predetermined performance thresholds for each block.