Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods

Authors: Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal an overlap between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that top Hessian eigenvectors tend to be concentrated around larger parameters, or equivalently, that larger parameters tend to align with directions of larger loss curvature. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.
Researcher Affiliation Collaboration Andres Fernandez EMAIL Tübingen AI Center University of Tübingen Frank Schneider EMAIL Tübingen AI Center University of Tübingen Maren Mahsereci EMAIL Yahoo Research Philipp Hennig EMAIL Tübingen AI Center University of Tübingen
Pseudocode Yes Algorithm 1: SSVD (from Tropp et al. (2019)) Algorithm 2: SEIGH
Open Source Code Yes To efficiently compute overlap, we develop SEIGH (Section 5 and Alg. 2), a matrix-free eigendecomposition based on sketched SVDs (Tropp et al., 2019). Our open source implementation2 allows to compute top-k Hessian eigendecompositions for ką103 on neural networks with over 10M parameters, an unprecedented scale by orders of magnitude. 2https://github.com/andres-fr/hessian_overlap
Open Datasets Yes MLP 16x16 MNIST Res Net18 Image Net CIFAR-10 3c3d-CNN Schneider et al. (2019) CIFAR-100 All-CNN-C Springenberg et al. (2015)
Dataset Splits Yes Table 1: Overview of experimental settings, detailing number of model parameters (D), learning rate (η), batch size (B), steps per epoch (T), test accuracy (acc) at step t, number of train/test samples used to compute Htrain/Htest (Ntrain{Ntest respectively), and number of SEIGH outer measurements (n O, see Alg. 2) Problem Model D η B T acc Ntrain{Ntest n O 16ˆ16 MNIST tanh-MLP Martens & Grosse (2015) 7030 0.3 500 100 95.78% (t 1000) 500/500 355 CIFAR-10 3c3d-CNN Schneider et al. (2019) 895,210 0.0226 128 312 74.52% (t 8000) 500/500 1000 CIFAR-100 All-CNN-C Springenberg et al. (2015) 1,387,108 0.1658 256 156 40.50% (t 8000) n.a./1000 1000 Image Net Res Net-18 He et al. (2016) 11,689,512 0.1 150 8207 17.33% (t 8000) n.a./5000 1500
Hardware Specification Yes Figure 18: Runtimes of main SEIGH operations to compute a single Hessian eigendecomposition, assuming a single computer with 400GB RAM equipped with an NVIDIA A100 (40GB) graphics card.
Software Dependencies No We used Py Torch Paszke et al. (2019). Curv Lin Ops
Experiment Setup Yes Table 1: Overview of experimental settings, detailing number of model parameters (D), learning rate (η), batch size (B), steps per epoch (T), test accuracy (acc) at step t, number of train/test samples used to compute Htrain/Htest (Ntrain{Ntest respectively), and number of SEIGH outer measurements (n O, see Alg. 2) Problem Model D η B T acc Ntrain{Ntest n O 16ˆ16 MNIST tanh-MLP Martens & Grosse (2015) 7030 0.3 500 100 95.78% (t 1000) 500/500 355 CIFAR-10 3c3d-CNN Schneider et al. (2019) 895,210 0.0226 128 312 74.52% (t 8000) 500/500 1000 CIFAR-100 All-CNN-C Springenberg et al. (2015) 1,387,108 0.1658 256 156 40.50% (t 8000) n.a./1000 1000 Image Net Res Net-18 He et al. (2016) 11,689,512 0.1 150 8207 17.33% (t 8000) n.a./5000 1500