Shallow diffusion networks provably learn hidden low-dimensional structure

Authors: Nicholas Boffi, Arthur Jacot, Stephen Tu, Ingvar Ziemann

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results that verify our theoretical claims. To this end, we compare the performance of a single hidden layer (Barron) neural network with a random feature network (Bach, 2015; Rahimi & Recht, 2007) and a two hidden layer neural network as a function of the ambient dimension D on two synthetic tasks that isolate the hidden structures we analyzed. To make the comparison as fair as possible, all networks are trained with gradient descent for 2.5 105 steps. We use a cosine decay schedule that initializes the learning rate from 5 10 4 and ends at 0. The single hidden layer neural network has a hidden layer width of 1024 neurons; to keep the number of parameters fixed, the corresponding random feature model has 2048 neurons. To avoid an unnatural reduction in width for the two hidden-layer neural network, we keep the width fixed at 1024 rather than fixing the number of parameters. We sample from each learned diffusion model using 1024 evenly-spaced timesteps on the horizon t [0, 5]. Hidden linear structure. We consider data points x RD given by x = Uz where z Rd with d = 2 is drawn from the Gaussian mixture model ρ = 1 N PN i=1 N(µi, Σi) with N = 8. Here, U RD d is a random orthogonal matrix that we keep fixed across all networks for constant D. We take Σi = 1 2I for all i and µi = r cos 2πi N , sin 2πi N , which leads to an eight-mode, narrowly-spaced mixture evenly distributed on the circle of radius r = 5. We study the performance of each network architecture as a function of D ranging from D = 2 to D = 64 in powers of two. Results are shown in Figure 1, where we visualize 5 104 samples ˆz = U ˆx from each model. In each case, performance degrades as D increases, reflective of the increasing difficulty of the learning problem. Notably, performance degrades significantly more rapidly for the random feature model than for the Barron model despite the fact that both models have the same number of parameters; in particular, for D = 32 and D = 64, samples do not lie within the visualized frame because the model becomes unstable. This is consistent with our theory, which predicts that sample complexity guarantees for Barron will depend exponentially on d = 2 rather than D, whereas the random feature model does not adapt to the latent structure Bach (2017). Unsurprisingly, given its extra parameters and additional nonlinear layer, the two-layer network maintains the highest performance for all D.
Researcher Affiliation Academia Nicholas M. Boffi Carnegie Mellon University EMAIL Arthur Jacot New York University EMAIL Stephen Tu University of Southern California EMAIL Ingvar Ziemann University of Pennsylvania EMAIL
Pseudocode No The paper describes methods and processes using mathematical formulations and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 3, "PROBLEM FORMULATION AND MAIN RESULTS," defines equations like (3.1) and (3.4) and explains the process, but not in an algorithmic format.
Open Source Code No The paper does not contain any explicit statements about releasing source code or providing links to a code repository. There is no mention of supplementary material containing code either.
Open Datasets No The paper describes the use of synthetic datasets for its experimental results in Section 4: "Hidden linear structure. We consider data points x RD given by x = Uz where z Rd with d = 2 is drawn from the Gaussian mixture model ρ = 1 N PN i=1 N(µi, Σi) with N = 8..." "Hidden independent components. We now consider data points x = (x1, x2, . . . , x K) RD where D = Kd and each xk Rd. We again take d = 2 for visualization, and we draw each xk from the Gaussian mixture model ρk = 1 N PN i=1 N(µk i , Σi) with N = 4." These are custom-generated datasets, and there is no mention of them being made publicly available, nor are there citations to existing public datasets used.
Dataset Splits No The paper mentions generating samples for visualization but does not specify any training, validation, or test splits for the datasets used in the experiments. For example, in Section 4, "Hidden linear structure," it states: "Results are shown in Figure 1, where we visualize 5 104 samples ˆz = U ˆx from each model." This describes the visualization of generated samples rather than a dataset split for training and evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments. It mentions training networks and sampling from models but omits any information on GPU, CPU, or memory specifications.
Software Dependencies No The paper mentions that "all networks are trained with gradient descent" and details the training schedule (cosine decay, learning rate) but does not provide specific software or library names with version numbers (e.g., Python, PyTorch, TensorFlow versions). This information is insufficient for software dependency reproduction.
Experiment Setup Yes In section 4, "EXPERIMENTAL RESULTS," the paper states: "To make the comparison as fair as possible, all networks are trained with gradient descent for 2.5 105 steps. We use a cosine decay schedule that initializes the learning rate from 5 10 4 and ends at 0. The single hidden layer neural network has a hidden layer width of 1024 neurons; to keep the number of parameters fixed, the corresponding random feature model has 2048 neurons. To avoid an unnatural reduction in width for the two hidden-layer neural network, we keep the width fixed at 1024 rather than fixing the number of parameters. We sample from each learned diffusion model using 1024 evenly-spaced timesteps on the horizon t [0, 5]." This provides concrete details on training steps, learning rate schedule, and network architectures.