reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shallow diffusion networks provably learn hidden low-dimensional structure

Authors: Nicholas Boffi, Arthur Jacot, Stephen Tu, Ingvar Ziemann

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experimental results that verify our theoretical claims. To this end, we compare the performance of a single hidden layer (Barron) neural network with a random feature network (Bach, 2015; Rahimi & Recht, 2007) and a two hidden layer neural network as a function of the ambient dimension D on two synthetic tasks that isolate the hidden structures we analyzed. To make the comparison as fair as possible, all networks are trained with gradient descent for 2.5 105 steps. We use a cosine decay schedule that initializes the learning rate from 5 10 4 and ends at 0. The single hidden layer neural network has a hidden layer width of 1024 neurons; to keep the number of parameters fixed, the corresponding random feature model has 2048 neurons. To avoid an unnatural reduction in width for the two hidden-layer neural network, we keep the width fixed at 1024 rather than fixing the number of parameters. We sample from each learned diffusion model using 1024 evenly-spaced timesteps on the horizon t [0, 5]. Hidden linear structure. We consider data points x RD given by x = Uz where z Rd with d = 2 is drawn from the Gaussian mixture model ρ = 1 N PN i=1 N(µi, Σi) with N = 8. Here, U RD d is a random orthogonal matrix that we keep fixed across all networks for constant D. We take Σi = 1 2I for all i and µi = r cos 2πi N , sin 2πi N , which leads to an eight-mode, narrowly-spaced mixture evenly distributed on the circle of radius r = 5. We study the performance of each network architecture as a function of D ranging from D = 2 to D = 64 in powers of two. Results are shown in Figure 1, where we visualize 5 104 samples ˆz = U ˆx from each model. In each case, performance degrades as D increases, reflective of the increasing difficulty of the learning problem. Notably, performance degrades significantly more rapidly for the random feature model than for the Barron model despite the fact that both models have the same number of parameters; in particular, for D = 32 and D = 64, samples do not lie within the visualized frame because the model becomes unstable. This is consistent with our theory, which predicts that sample complexity guarantees for Barron will depend exponentially on d = 2 rather than D, whereas the random feature model does not adapt to the latent structure Bach (2017). Unsurprisingly, given its extra parameters and additional nonlinear layer, the two-layer network maintains the highest performance for all D.
Researcher Affiliation	Academia	Nicholas M. Boffi Carnegie Mellon University EMAIL Arthur Jacot New York University EMAIL Stephen Tu University of Southern California EMAIL Ingvar Ziemann University of Pennsylvania EMAIL
Pseudocode	No	The paper describes methods and processes using mathematical formulations and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 3, "PROBLEM FORMULATION AND MAIN RESULTS," defines equations like (3.1) and (3.4) and explains the process, but not in an algorithmic format.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or providing links to a code repository. There is no mention of supplementary material containing code either.
Open Datasets	No	The paper describes the use of synthetic datasets for its experimental results in Section 4: "Hidden linear structure. We consider data points x RD given by x = Uz where z Rd with d = 2 is drawn from the Gaussian mixture model ρ = 1 N PN i=1 N(µi, Σi) with N = 8..." "Hidden independent components. We now consider data points x = (x1, x2, . . . , x K) RD where D = Kd and each xk Rd. We again take d = 2 for visualization, and we draw each xk from the Gaussian mixture model ρk = 1 N PN i=1 N(µk i , Σi) with N = 4." These are custom-generated datasets, and there is no mention of them being made publicly available, nor are there citations to existing public datasets used.
Dataset Splits	No	The paper mentions generating samples for visualization but does not specify any training, validation, or test splits for the datasets used in the experiments. For example, in Section 4, "Hidden linear structure," it states: "Results are shown in Figure 1, where we visualize 5 104 samples ˆz = U ˆx from each model." This describes the visualization of generated samples rather than a dataset split for training and evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments. It mentions training networks and sampling from models but omits any information on GPU, CPU, or memory specifications.
Software Dependencies	No	The paper mentions that "all networks are trained with gradient descent" and details the training schedule (cosine decay, learning rate) but does not provide specific software or library names with version numbers (e.g., Python, PyTorch, TensorFlow versions). This information is insufficient for software dependency reproduction.
Experiment Setup	Yes	In section 4, "EXPERIMENTAL RESULTS," the paper states: "To make the comparison as fair as possible, all networks are trained with gradient descent for 2.5 105 steps. We use a cosine decay schedule that initializes the learning rate from 5 10 4 and ends at 0. The single hidden layer neural network has a hidden layer width of 1024 neurons; to keep the number of parameters fixed, the corresponding random feature model has 2048 neurons. To avoid an unnatural reduction in width for the two hidden-layer neural network, we keep the width fixed at 1024 rather than fixing the number of parameters. We sample from each learned diffusion model using 1024 evenly-spaced timesteps on the horizon t [0, 5]." This provides concrete details on training steps, learning rate schedule, and network architectures.