The Underlying Universal Statistical Structure of Natural Datasets
Authors: Noam Itzhak Levi, Yaron Oz
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study universal properties in real-world complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior. ... In Fig. 1, we show the Σij,M eigenvalue power law decay for the different classes of data (i.e. real-world, UGD and CGDs). ... In Fig. 3, we demonstrate that the bulk of eigenvalues for various real-world datasets behaves as the energy eigenvalues of a quantum chaotic system described by the GOE universality class. |
| Researcher Affiliation | Academia | 1École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland 2 Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel-Aviv 69978, Israel. Correspondence to: Noam Levi <EMAIL>. |
| Pseudocode | No | The paper describes mathematical procedures and RMT diagnostic tools, such as the unfolding procedure in Appendix A, using equations and descriptive text. However, it does not present any of these as structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing code, nor does it provide a link to a code repository. The text focuses on theoretical analysis and empirical observations without offering an implementation for public access. |
| Open Datasets | Yes | We study the following real-world datasets: MNIST (Le Cun et al., 2010), FMNIST (Xiao et al., 2017), CIFAR10 (cif), Tiny-IMAGENET (Torralba et al., 2008), and Celeb A (Liu et al., 2015) (downsampeld to 109 89 in grayscale). |
| Dataset Splits | No | The paper mentions using 'the entire dataset' for real-world datasets and 'M = 50k for the gaussian data', and also varying 'M' for convergence studies. While it refers to sample sizes, it does not specify explicit training, validation, or test splits for any of the datasets used to reproduce experiments or evaluate models. For example, it does not state '80/10/10 split' or provide sample counts for different subsets used for model training/testing. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to conduct the experiments, such as CPU or GPU models, memory, or computing cluster specifications. |
| Software Dependencies | No | The paper does not list any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for the analysis or simulations. |
| Experiment Setup | No | The paper primarily focuses on the statistical analysis of datasets and theoretical modeling using RMT. While it describes data preprocessing (centering and normalizing) and parameters for generating synthetic data, it does not provide specific experimental setup details like hyperparameters (e.g., learning rates, batch sizes, epochs) for training machine learning models in an empirical setting. The 'Teacher-student model' in Appendix F is presented as a theoretical example with analytical solutions, not as an empirical experimental setup. |