Revisiting minimum description length complexity in overparameterized models

Authors: Raaz Dwivedi, Chandan Singh, Bin Yu, Martin Wainwright

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs.
Researcher Affiliation Collaboration Raaz Dwivedi EMAIL Department of Operations Research & Information Engineering Cornell Tech, Cornell University New York City, NY Chandan Singh EMAIL Microsoft Research Seattle, WA Bin Yu EMAIL Department of Statistics, and Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA Martin Wainwright EMAIL Department of Electrical Engineering and Computer Sciences, and Mathematics Massachusetts Institute of Technology Cambridge, MA
Pseudocode No The paper describes algorithms and methods in text and mathematical formulas but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Code and documentation for easily reproducing the results are provided at github.com/csinva/mdl-complexity.
Open Datasets Yes Databases are taken from PMLB (Olson et al., 2017; Vanschoren et al., 2013), a repository of diverse tabular databases for benchmarking machine-learning algorithms. ...functional magnetic-resonance imaging (fMRI), as they are shown natural movies (Nishimoto et al., 2011).
Dataset Splits Yes The test set consists of 25% of the entire dataset. The training data consists of 7,200 time points and the test data consists of 540 time points, where at each timepoint a subject is watching a video clip. Moreover, this criterion can provide computational savings especially while training overparameterized models in contrast to the vanilla K-fold cross-validation (since computation is only required for a single fold).
Hardware Specification No The paper does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments. It only implies the use of computing resources for running simulations and experiments.
Software Dependencies No linear models (ridge) and kernel methods are fit using scikit-learn (Pedregosa et al., 2011) and optimization for hyper-parameter tuning (see (34)) is performed using Sci Py (Virtanen et al., 2020). For the neural tangent kernel computation, we use the neural-tangents library (Novak et al., 2020) with its default parameters.
Experiment Setup Yes We tune the parameter λ over 20 values equally spaced on a log-scale from 10 3 to 106. We vary the number of covariates (d) used for fitting the model and report the results for d/n {1/10, 1/2, 1, 2, 10} (noting that we have a misspecified model when fitting with d < 50 features). For a given dataset, we fix d to be the number of features, and we vary n downwards from its maximum value (by subsampling the dataset) to construct instances with different values of the ratio d/n. The hyperparameter λ takes on 10 values equally spaced on a log scale between 10 3 and 103. In all f MRI experiments, λ takes on 40 values equally spaced on a log scale between 100 and 106. For the neural tangent kernel computation, we use the neural-tangents library (Novak et al., 2020) with its default parameters (Re LU nonlinearity, two hidden linear layers with hidden size of 512).