Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

Authors: Adwait Datar, Nihat Ay

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical results with empirical studies in Section 5, extending the analysis to practical settings where only finite samples from the target distribution are available.
Researcher Affiliation Academia 1Institute for Data Science Foundations Hamburg University of Technology 21073 Hamburg, Germany 2Santa Fe Institute Santa Fe, NM 87501, USA 3Leipzig University 04109 Leipzig, Germany
Pseudocode No The dynamics described by equation 30, equation 31 and equation 32 can be written in the general form x(k + 1) = (I αQ) x(k) + αQx where Q is a symmetric positive definite matrix. This is a mathematical description of the dynamics, not structured pseudocode or an algorithm block.
Open Source Code No No concrete access to source code is provided. The paper does not contain any statements about releasing code or links to repositories.
Open Datasets No Given a data sequence D sampled with respect to q, we can estimate this expectation using the empirical mean... No specific public datasets or access details are provided for the data sequence D.
Dataset Splits No This section aims to bridge the gap between our theoretical analysis and practical implementations by investigating empirical versions of the KL divergence and their associated optimization dynamics. ... where at each iteration k, a mini-batch Dk D is drawn uniformly at random... The paper refers to mini-batching for SGD, but does not provide specific training/test/validation dataset splits.
Hardware Specification No No specific hardware details (like GPU/CPU models or specific computer configurations) are mentioned in the paper.
Software Dependencies No No specific software versions or libraries are mentioned in the paper.
Experiment Setup Yes This is illustrated in Figure 8, where we set αη = αθ = αng = 0.01 for n = 2 (left) and αη = αθ = αng = 0.001 for n = 10 (right). ... Table 1: Optimal Learning Rates and Convergence Times from Figure 9 optimal learning rate optimal convergence time η coordinates αη = 0.0036 k = 29 natural gradient αng [0.7141, 1.16] k = 2 θ coordinates αθ [11.96, 14.24] k = 12