Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence
Authors: Adwait Datar, Nihat Ay
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our theoretical results with empirical studies in Section 5, extending the analysis to practical settings where only finite samples from the target distribution are available. |
| Researcher Affiliation | Academia | 1Institute for Data Science Foundations Hamburg University of Technology 21073 Hamburg, Germany 2Santa Fe Institute Santa Fe, NM 87501, USA 3Leipzig University 04109 Leipzig, Germany |
| Pseudocode | No | The dynamics described by equation 30, equation 31 and equation 32 can be written in the general form x(k + 1) = (I αQ) x(k) + αQx where Q is a symmetric positive definite matrix. This is a mathematical description of the dynamics, not structured pseudocode or an algorithm block. |
| Open Source Code | No | No concrete access to source code is provided. The paper does not contain any statements about releasing code or links to repositories. |
| Open Datasets | No | Given a data sequence D sampled with respect to q, we can estimate this expectation using the empirical mean... No specific public datasets or access details are provided for the data sequence D. |
| Dataset Splits | No | This section aims to bridge the gap between our theoretical analysis and practical implementations by investigating empirical versions of the KL divergence and their associated optimization dynamics. ... where at each iteration k, a mini-batch Dk D is drawn uniformly at random... The paper refers to mini-batching for SGD, but does not provide specific training/test/validation dataset splits. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or specific computer configurations) are mentioned in the paper. |
| Software Dependencies | No | No specific software versions or libraries are mentioned in the paper. |
| Experiment Setup | Yes | This is illustrated in Figure 8, where we set αη = αθ = αng = 0.01 for n = 2 (left) and αη = αθ = αng = 0.001 for n = 10 (right). ... Table 1: Optimal Learning Rates and Convergence Times from Figure 9 optimal learning rate optimal convergence time η coordinates αη = 0.0036 k = 29 natural gradient αng [0.7141, 1.16] k = 2 θ coordinates αθ [11.96, 14.24] k = 12 |