Regularized Rényi Divergence Minimization through Bregman Proximal Gradient Algorithms

Authors: Thomas Guilmeau, Emilie Chouzenoux, Víctor Elvira

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These new theoretical insights lead to a versatile, robust, and competitive method, as illustrated by numerical experiments. Keywords: Variational inference, R enyi divergence, Kullback-Leibler divergence, Exponential family, Bregman proximal gradient algorithms.
Researcher Affiliation Academia Thomas Guilmeau EMAIL Emilie Chouzenoux EMAIL Centre pour la Vision Num erique Universit e Paris-Saclay, INRIA, Centrale Sup elec 9 rue Joliot Curie 91190, Gif-sur-Yvette, France V ıctor Elvira EMAIL School of Mathematics University of Edinburgh James Clerk Maxwell Building, The King s Buildings Peter Guthrie Tait Road Edinburgh, EH9 3FD, United Kingdom
Pseudocode Yes Algorithm 1: Proposed Bregman proximal gradient algorithm Algorithm 2: Monte Carlo proximal relaxed moment-matching algorithm
Open Source Code No The paper does not provide concrete access to source code. It only mentions a license for the paper itself, not the code for the described methodology.
Open Datasets No We generate synthetic data from a sparse vector β being sampled such that ρδ0(dβi) + (1 ρ)L(βi; 0, λ1)dβi .
Dataset Splits Yes We have set up the experiment, with J = 100, Jtest = 50, ρ = 0.3, λ0 = 1, λ1 = 20, and σ2 = 0.5. We perform Ntest = 50 tests.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes All the algorithms are run considering an approximating exponential family Q, chosen as the family of Gaussian distributions with diagonal covariance matrices. The parametrization of this family is detailed in Appendix C. For the PRMM algorithm, we use the regularizer r(θ) = η θ1 1 with η 0. This can be understood as the Lagrangian relaxation (Hiriart-Urruty and Lemar echal, 1993) with multiplier η 0 of the constraint Pd i=1 θ1 1 c, for some c 0 such that the constrained set is non empty. Our ℓ1-like regularizer enforces sparsity on all the components of the mean µ, except the component µ0. The idea is to mimic the sparse structure of β that was simulated from p0. The computation R enyi Divergence Minimization with BPG Algorithms of the corresponding Bregman proximal operator for this choice of regularizer is detailed in Appendix C. In order to assess the performance of the algorithms, we track the variational R enyi bound, defined Eq. (14), that is estimated at each iteration k N through L(α) π (θk) 1 This quantity can be understood as an approximation of the R enyi divergence that we are trying to minimize (Li and Turner, 2016), and thus as a measure of the proximity between π and its approximation qθk. We also consider the F1 score that each algorithm achieves in the prediction of the zeros of the true regression vector β. It is computed at each iteration k N, by seeing how the zeros of µk match those of β. Additionally, we sample regression vectors β from the final proposal qθK, and average their absolute error over a test data set {(z(j) 0 , z(j) T )}Jtest j=1 . This is done by computing the following mean absolute error (MAE): MAEtest(θ) = 1 Ntest z(j) T ΦT β(i)(z(j) 0 ) , with β(i) qθ, i J1, Ntest K. (23) We compute this quantity for the final proposal qθK in each run and analyze the distributions of the obtained values, in order to assess the robustness of the methods. We run the algorithms for K = 100 iterations, with N = 500 samples per iteration. The PRMM and RMM algorithms have been implemented with constant step size τ = 10 1, while the VRB algorithm uses a constant step size τ = 10 4. These choices correspond to the most favorable step-size rule, for each algorithm, as indicated by our extended experiments in Appendix D. For the PRMM algorithm, we use the regularization parameter η = 0.5. We have set up the experiment, with J = 100, Jtest = 50, ρ = 0.3, λ0 = 1, λ1 = 20, and σ2 = 0.5. We perform Ntest = 50 tests. We discarded one run of the VRB algorithm for α = 1 for which iterates had become singular. We now present and discuss the result of our experiments. Figure 2 shows the increase of the approximated variational R enyi bound described in Eq. (22). As discussed in Section 3.3, an increase in the R enyi bound L(α) π (θ) shows a decrease in the R enyi divergence RDα(π, qθ), so these plots show that the three method decrease the R enyi divergence. However, our methods are able to reach higher values than the VRB method at a faster rate, illustrating the improvement coming by using the Bregman geometry rather than the Euclidean one. Figure 3 shows the F1 score achieved by each algorithm in the retrieval of the zeros of the true regression vector. The RMM and VRB algorithms are not able to recover any zeros, showing using only a spike-and-slab LASSO prior is not enough to get approximations with sparse means. The PRMM algorithms uses a proximity operator to recover the zeros of the sought regression vector. This additional operator leads to a very good recovery of the zeros of the regression vector. Note that the regularizing term is multiplied by a scalar value η > 0 that controls the strength of the regularization relative to the R enyi divergence term. Setting this value to achieve the best recovery while preserving good Guilm eau, Chouzenoux, and Elvira (a) α = 0.5 (b) α = 1.0 Figure 2: Approximated R enyi bound, averaged over 100 runs with N = 500 samples per iteration. (a) α = 0.5 (b) α = 1.0 Figure 3: F1 score in the prediction of the zeros of β by the zeros of {µk}K k=0, averaged over 100 runs with N = 500 samples per iteration. approximating properties is difficult a priori, as it is the case for instance in maximum a posteriori estimation. The box plots of Fig. 4 assess the quality of the variational approximation of the posterior obtained by each method, by evaluating how regression vectors sampled from the approximations are able to reconstruct the test data. We see that the PRMM and RMM algorithms yield reconstruction errors that are less spread and at a lower level than the ones coming from the VRB algorithm. Note also that the VRB algorithm also produces badly-performing outlier values. This shows the higher approximation performance and robustness coming from using a more adapted geometry, as in our proposed method. Errors are more spread for the PRMM algorithm than for the RMM algorithm. This may be due to the sparsity-inducing proximal step, which creates larger eigenvalues for the covariance matrix (see Appendix C for details).