reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalization through variance: how noise shapes inductive biases in diffusion models

Authors: John Vastola

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we develop a mathematical theory that partly explains this generalization through variance phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with gaps filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training.
Researcher Affiliation	Academia	John J. Vastola Department of Neurobiology Harvard Medical School Boston, MA 02115, USA EMAIL
Pseudocode	No	The paper describes mathematical derivations and theoretical concepts in prose and equations. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	See https://github.com/john-vastola/gtv-iclr25 for code that produces Fig. 1-3.
Open Datasets	No	The paper mentions 'CIFAR-10' and 'Image Net-64' as datasets used by other state-of-the-art models for context, but its own theoretical analysis and illustrative figures (Fig. 1-3) use synthetic or toy data ('four example 2D data distributions', 'a 1D data distribution { 1, 0, 1}', 'a 2D data distribution'). No concrete access information for these specific datasets is provided, nor are the well-known datasets used for the paper's own results.
Dataset Splits	No	The paper's results are based on theoretical analysis and illustrative examples using synthetic data. Standard training/test/validation splits are not mentioned for the data distributions used to generate the figures.
Hardware Specification	No	The paper does not mention any specific hardware (like GPU or CPU models) used for conducting its theoretical analyses or generating the illustrative figures.
Software Dependencies	No	The paper does not explicitly state any software dependencies with specific version numbers. While code is provided, the paper text itself lacks this information.
Experiment Setup	No	The paper describes theoretical models and their parameters (e.g., 'N = 100' for linear models, 'Gaussian features', 'Fourier features', 'time cutoff ϵ and ratio F/P'). However, these are parameters of the theoretical analysis and illustrative examples, not concrete hyperparameters or training configurations for a machine learning experiment that would typically be described in an 'experimental setup' section for reproducibility.