Flow-based Variational Mutual Information: Fast and Flexible Approximations
Authors: Caleb Dahlke, Jason Pacheco
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally verify that our new methods are effective on large MI problems where discriminative-based estimators, such as MINE and Info NCE, are fundamentally limited. Furthermore, we compare against a diverse set of benchmarking tests to show that the flow-based estimators often perform as well, if not better, than the discriminative-based counterparts. Finally, we demonstrate how these estimators can be effectively utilized in the Bayesian Optimal Experimental Design setting for online sequential decision making. |
| Researcher Affiliation | Academia | Caleb Dahlke Department of Mechanical Engineering University of Michigan Ann Arbor, MI, USA EMAIL Jason Pacheco Department of Computer Science University of Arizona Tucson, AZ, USA EMAIL |
| Pseudocode | Yes | A.1 PSEUDOCODE We briefly provide an outline of each methods pseudo code. So of the notation is lightly changed in an attempt to highlight parameters of models. For example, for NVF, fprior(X) is referred to as fθprior(X) to highlight the parameters, θprior, of the prior flow. for i=1:K do Sample {xj}j=1:N p(X); Sample {yj}j=1:N p(y | xj); µ 1 N PN j=1(fθ(xj)T , gψ(yj)T )T ; Σ 1 N 1 PN j=1(fθ(xj)T , gψ(yj)T )(fθ(xj)T , gψ(yj)T )T µµT ; Loss log |2πeΣ| 2 N PN j=1 log | xfθ(xj)| 2 N PN j=1 log | ygψ(yj)|; θ θ α θLoss; ψ ψ β ψLoss; end µ 1 N PN j=1(fθ(xj)T , gψ(yj)T )T ; Σ 1 N 1 PN j=1(fθ(xj)T , gψ(yj)T )(fθ(xj)T , gψ(yj)T )T µµT ; IJV F 1 2 log |2πeΣZ| 1 2 log 2πeΣZ|V Algorithm 1: JVF Pseudocode |
| Open Source Code | Yes | All code can be found at https://github.com/calebdahlke/Flow MI. |
| Open Datasets | Yes | Our next experiment is a collection of benchmarks from Czy z et al. (2023). They construct a diverse family of distributions with known ground-truth MI consisting of Gaussian, Uniform, and Student-T distributions that have MI invariant transformations applied to them (see Appendix B.2 for more details). |
| Dataset Splits | Yes | We take a large set of 75, 000 samples to train our distribution based estimators as well as a variety of discriminative based estimators for 3000 training steps using a batch size of N = 256. We utilize a 80-20 split of training and testing samples of the total samples. Each estimator is given 1000 samples per dimension with a train test split of 50-50. |
| Hardware Specification | Yes | All experiments were run on a high-performance computing cluster with nodes consisting of 2x AMD EPYC 7642 48-core (Rome) CPUs, 512GB of RAM, and NVIDIA V100S GPUs. |
| Software Dependencies | No | The paper mentions various software components such as "neural network", "Re LU activation functions", "Rational Quadratic Splines (Durkan et al., 2019)" but does not specify their version numbers or versions of programming languages/libraries like Python or PyTorch. |
| Experiment Setup | Yes | We train all estimators using a batch size of N = 256 for a total of 3000 steps to ensure all estimators converged. The critic based methods (DV, MINE, Info NCE, and NWJ) all use a neural network with two hidden layers of the structure [16, 8] with Re LU activation functions. Czy z et al. (2023) utilized a lr = .1 and batch size of N = 256 which we kept the same. The normalizing flows utilized in JVF and NVF are Rational Quadratic Splines (Durkan et al., 2019) which learn 128 knots to parameterize the spline. The spline is bounded between 8 and 8 where is is linear outside. The knots are learned from a neural network of size [8, 8] with Re LU activation functions. To prevent overfitting of the flows, we utilize dropout with a rate of .2 and L2 Regularization with a factor of 10 5. The neural parameters used in NVG and NVF are the mean and variance parameterized by two neural networks of size [16, 8] with relu activation functions. The output of the neural network parameterizing the covariance is a lower triangular matrix to approaximate the cholesky decomposition of Σ where the diagonal elements have a Softplus activation applied to them to ensure the resulting matrix is semi-positive definite. These parameters did not have dropout performed but did have the same L2 Regularization with a factor of 10 5. NVG, JVF, and JVF all utilized a learning rate of .005 and reduced learning rate by a factor of .1 on test loss if no improvement was seen over 250 testing steps. |