Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
End-to-end Learning of Gaussian Mixture Priors for Diffusion Sampler
Authors: Denis Blessing, Xiaogang Jia, Gerhard Neumann
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we test the impact of our proposed end-to-end learning scheme for prior distributions. Specifically, we consider three distinct settings: First, we evaluate these methods with a Gaussian prior that is fixed during training. Second and third, we consider learned Gaussian (GP) and Gaussian mixture priors (GMP). For evaluation, we consider the effective sample size (ESS) and the marginal or extended evidence lower bound as performance criteria. Both are denoted as ELBO for convenience. Next, if the ground truth normalization constant Z is available, we use an importance-weighted estimate Zˆ to compute the estimation error | log Z − log Zˆ|. Additionally, if samples from the target π are available, we compute the Sinkhorn distance W2γ (Cuturi, 2013). To ensure a fair comparison, all experiments are conducted under identical settings. Our evaluation methodology adheres to the protocol by Blessing et al. (2024). For a comprehensive overview of the experimental setup see Appendix C. Moreover, a comprehensive set of ablation studies and additional experiments, are provided in Appendix D. |
| Researcher Affiliation | Academia | 1Autonomous Learning Robots, Karlsruhe Institute of Technology 2FZI Research Center for Information Technology |
| Pseudocode | Yes | Algorithm 1 Training of diffusion sampler with learnable prior |
| Open Source Code | No | The paper mentions using the 'Jax library (Bradbury et al., 2021)' for experiments, but it does not provide any explicit statement or link for the open-sourcing of the authors' own implementation code for the methodology described in the paper. |
| Open Datasets | Yes | Next, we consider the Fashion target which uses NICE (Dinh et al., 2014) to train a normalizing flow on the high-dimensional d = 28 × 28 = 784 MNIST Fashion dataset. A recent study by Blessing et al. (2024) showed that current state-of-the-art methods were not able to generate samples with high quality from multiple modes. MNIST variants (DIGITS) and Fashion MNIST (Fashion) datasets using NICE (Dinh et al., 2014) to train normalizing flows, with resolutions 14 × 14 and DIGITS and 28 × 28 for Fashion. |
| Dataset Splits | No | The paper mentions total dataset sizes (e.g., 'd = 35, 351 (xi, yi) pairs)' for Ionosphere) and the use of 'standardized binary classification datasets' and 'MNIST Fashion dataset', but it does not specify any training, validation, or test splits for these datasets. It refers to these as benchmark problems and implies standard use, but the explicit split percentages or counts are not provided. |
| Hardware Specification | No | The paper states: 'All experiments are conducted using the Jax library (Bradbury et al., 2021).' and 'Our default experimental setup, unless specified otherwise, is as follows: We use a batch size of 2000 (halved if memory-constrained) and train for 140k gradient steps to ensure approximate convergence.' It also mentions 'bw HPC' and the 'Hore Ka supercomputer' in the acknowledgments, which are general computing resources, but lacks specific details on GPUs, CPUs, or memory models used for the experiments. |
| Software Dependencies | No | The paper states: 'All experiments are conducted using the Jax library (Bradbury et al., 2021).' While it mentions Jax as a library, it does not specify a version number for Jax or any other key software dependencies like Python, PyTorch, or CUDA versions, which are crucial for reproducibility. |
| Experiment Setup | Yes | Our default experimental setup, unless specified otherwise, is as follows: We use a batch size of 2000 (halved if memory-constrained) and train for 140k gradient steps to ensure approximate convergence. We use the Adam optimizer (Kingma & Ba, 2014), gradient clipping with a value of 1, and a learning rate scheduler that starts at 8e-3 and uses a cosine decay starting at 60k gradient steps. We utilized 128 discretization steps and the Euler-Maruyama method for integration. The control functions uθ and vγ were parameterized as two-layer neural networks with 128 neurons. For DBS, we set the drift to f = σ2 log π. ... We use a separate learning rate of 10e-2 for all experiments to allow for quick adaptation of the Gaussian components. Furthermore, the mean was initialized at 0 and the initial covariance matrix was set to the identity except for Fashion where we set the initial variance to 5 which roughly covers the support of the target. ... If not otherwise specified, we use K = 10 mixture components for X-GMP. |