reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Underlying Universal Statistical Structure of Natural Datasets

Authors: Noam Itzhak Levi, Yaron Oz

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study universal properties in real-world complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior. ... In Fig. 1, we show the Σij,M eigenvalue power law decay for the different classes of data (i.e. real-world, UGD and CGDs). ... In Fig. 3, we demonstrate that the bulk of eigenvalues for various real-world datasets behaves as the energy eigenvalues of a quantum chaotic system described by the GOE universality class.
Researcher Affiliation	Academia	1École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland 2 Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel-Aviv 69978, Israel. Correspondence to: Noam Levi <EMAIL>.
Pseudocode	No	The paper describes mathematical procedures and RMT diagnostic tools, such as the unfolding procedure in Appendix A, using equations and descriptive text. However, it does not present any of these as structured pseudocode or an algorithm block.
Open Source Code	No	The paper does not contain any explicit statement about releasing code, nor does it provide a link to a code repository. The text focuses on theoretical analysis and empirical observations without offering an implementation for public access.
Open Datasets	Yes	We study the following real-world datasets: MNIST (Le Cun et al., 2010), FMNIST (Xiao et al., 2017), CIFAR10 (cif), Tiny-IMAGENET (Torralba et al., 2008), and Celeb A (Liu et al., 2015) (downsampeld to 109 89 in grayscale).
Dataset Splits	No	The paper mentions using 'the entire dataset' for real-world datasets and 'M = 50k for the gaussian data', and also varying 'M' for convergence studies. While it refers to sample sizes, it does not specify explicit training, validation, or test splits for any of the datasets used to reproduce experiments or evaluate models. For example, it does not state '80/10/10 split' or provide sample counts for different subsets used for model training/testing.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to conduct the experiments, such as CPU or GPU models, memory, or computing cluster specifications.
Software Dependencies	No	The paper does not list any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for the analysis or simulations.
Experiment Setup	No	The paper primarily focuses on the statistical analysis of datasets and theoretical modeling using RMT. While it describes data preprocessing (centering and normalizing) and parameters for generating synthetic data, it does not provide specific experimental setup details like hyperparameters (e.g., learning rates, batch sizes, epochs) for training machine learning models in an empirical setting. The 'Teacher-student model' in Appendix F is presented as a theoretical example with analytical solutions, not as an empirical experimental setup.