Towards Robust Scale-Invariant Mutual Information Estimators

Authors: Cheuk Ting Leung, Rohan Ghosh, Mehul Motani

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical and empirical results and show that the original un-normalized estimators are not scale-invariant and highlight the consequences of an estimator s scale-dependence. We propose new global normalization strategies that are tuned to the corresponding estimator and scale invariant. We compare our global normalization strategies to existing local normalization strategies and provide intuitive and empirical arguments to support the use of global normalization. Extensive experiments across multiple distributions and settings are conducted, and we find that our proposed variants KSG-Global-L and MINE-Global-Corrected are most accurate within their respective approaches. Finally, we perform an information plane analysis of neural networks and observe clearer trends of fitting and compression using the normalized estimators compared to the original un-normalized estimators. Our work highlights the importance of scale awareness and global normalization in the MI estimation problem.
Researcher Affiliation Academia Cheuk Ting Leung* EMAIL Department of Electrical and Computer Engineering College of Design and Engineering National University of Singapore Rohan Ghosh* EMAIL Department of Electrical and Computer Engineering College of Design and Engineering National University of Singapore Mehul Motani EMAIL Department of Electrical and Computer Engineering College of Design and Engineering Institute of Data Science N.1 Institute for Health Institute for Digital Medicine (Wis DM) National University of Singapore
Pseudocode No The paper describes methodologies using mathematical formulations and textual descriptions, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 3 'Methodology' and its subsections detail normalization strategies and estimators without presenting them in a structured algorithm format.
Open Source Code Yes The details for our estimators are provided in Appendix F. We used the NPEET MI estimator toolbox for estimating KSG and KSG-based measures 1. For MINE, we used a pytorch-based package 2. Code for all our experiments is available in the Supplementary Material.
Open Datasets Yes We evaluate MI on IB (Shwartz-Ziv & Tishby, 2017), MNIST (Deng, 2012), CIFAR-10 (Krizhevsky & Hinton, 2009) and SVHN (Netzer et al., 2011) datasets.
Dataset Splits No The paper mentions using specific datasets like IB, MNIST, CIFAR-10, and SVHN for training, and details batch sizes and epochs (e.g., 'batch sizes were 256 for the IB dataset, 128 for the MNIST dataset, and 512 for the CIFAR-10 and SVHN datasets'). However, it does not explicitly provide information about how these datasets were split into training, validation, or test sets (e.g., specific percentages, sample counts, or references to predefined splits).
Hardware Specification No The paper provides extensive details about the experimental setup, including parameters, network architectures, and training configurations, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used to run the experiments. Phrases like 'trained on' or 'experiments conducted' are used without hardware specifics.
Software Dependencies No We used the NPEET MI estimator toolbox for estimating KSG and KSG-based measures 1. For MINE, we used a pytorch-based package 2. The paper mentions specific software packages (NPEET, pytorch-based package) but does not provide their specific version numbers, which is necessary for full reproducibility.
Experiment Setup Yes Key Parameters: Figure 1: Average MI estimates for KSG for a varying number of noise dimensions Setup: Additive Gaussian (X, T R2 where T = X + ϵ, with ϵ N 0, σ2I2 ) Number of Samples: 1000 Number of Trials: 10 σ = 1 ... Overall MINE implementation: For estimating I(X; T), we used single-hidden layer Re LU-activated neural networks of the configuration: (d X + d T ) H1 H2 . . . Hk 1, where d X + d T is the dimensionality of the input and H1, . . . , Hk is the number of hidden neurons for each hidden layer. The last layer is a linear layer. We used the Adam optimizer with a learning rate of 0.001. The hidden neuron configuration varies depending on our experiment. We set the number of epochs to 50 for all experiments. ... The batch sizes were 256 for the IB dataset, 128 for the MNIST dataset, and 512 for the CIFAR-10 and SVHN datasets.