Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

Authors: Qi Chen, Jierui Zhu, Florian Shkurti

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.
Researcher Affiliation Academia Qi Chen1,3,4, , Jierui Zhu2, & Florian Shkurti1,4,5, 1Department of Computer Science, University of Toronto 2Department of Statistical Sciences, University of Toronto 3Data Science Institute, 4Vector Institute, 5 Robotics Institute Correspondence to: EMAIL, EMAIL.
Pseudocode No The paper describes methods using mathematical formulations and textual descriptions, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our experimental code is available at https://github.com/livre Q/Info Gen Analysis.
Open Datasets Yes Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory. ...We begin by validating the theorem on a simple synthetic 2D dataset derived from the Swiss Roll dataset. ...We further estimate the bound and the test data KL divergence (or log densities) by training DMs on MNIST and CIFAR10 datasets with few-shot data (m = 16) and full train dataset.
Dataset Splits Yes We train the score matching model sθ(x, t) and estimate the upper bound in Theorem 6.2 on a training set of size m. W.r.t the expectation over dataset S, we conduct 5-times Monte-Carlo estimation by randomly generating train datasets with different random seeds. For the left-hand-side KL-divergence, we conduct Monte Carlo estimation of with 1000 test data points. ...We further estimate the bound and the test data KL divergence (or log densities) by training DMs on MNIST and CIFAR10 datasets with few-shot data (m = 16) and full train dataset.
Hardware Specification Yes The experiments for Swill Roll data were running on a machine with 1 2080Ti GPU of 11GB memory. The experiments for MNIST and CIFAR10 were running on several server nodes with 6 CPUs and 1 GPU of 32GB memory.
Software Dependencies No We optimized the model parameters using the Adam optimizer with a learning rate of η = 10 3. The training was conducted with a batch size of 64, while the remaining Adam hyperparameters were kept at their default values in Py Torch.
Experiment Setup Yes We optimized the model parameters using the Adam optimizer with a learning rate of η = 10 3. The training was conducted with a batch size of 64, while the remaining Adam hyperparameters were kept at their default values in Py Torch. The model was trained for 100 epochs. ...The score matching model sθ(x, t) is trained for 10000 iterations, and the backward generation takes 1000 steps, i.e., N = 1000.