Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods
Authors: Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal an overlap between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that top Hessian eigenvectors tend to be concentrated around larger parameters, or equivalently, that larger parameters tend to align with directions of larger loss curvature. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace. |
| Researcher Affiliation | Collaboration | Andres Fernandez EMAIL Tübingen AI Center University of Tübingen Frank Schneider EMAIL Tübingen AI Center University of Tübingen Maren Mahsereci EMAIL Yahoo Research Philipp Hennig EMAIL Tübingen AI Center University of Tübingen |
| Pseudocode | Yes | Algorithm 1: SSVD (from Tropp et al. (2019)) Algorithm 2: SEIGH |
| Open Source Code | Yes | To efficiently compute overlap, we develop SEIGH (Section 5 and Alg. 2), a matrix-free eigendecomposition based on sketched SVDs (Tropp et al., 2019). Our open source implementation2 allows to compute top-k Hessian eigendecompositions for ką103 on neural networks with over 10M parameters, an unprecedented scale by orders of magnitude. 2https://github.com/andres-fr/hessian_overlap |
| Open Datasets | Yes | MLP 16x16 MNIST Res Net18 Image Net CIFAR-10 3c3d-CNN Schneider et al. (2019) CIFAR-100 All-CNN-C Springenberg et al. (2015) |
| Dataset Splits | Yes | Table 1: Overview of experimental settings, detailing number of model parameters (D), learning rate (η), batch size (B), steps per epoch (T), test accuracy (acc) at step t, number of train/test samples used to compute Htrain/Htest (Ntrain{Ntest respectively), and number of SEIGH outer measurements (n O, see Alg. 2) Problem Model D η B T acc Ntrain{Ntest n O 16ˆ16 MNIST tanh-MLP Martens & Grosse (2015) 7030 0.3 500 100 95.78% (t 1000) 500/500 355 CIFAR-10 3c3d-CNN Schneider et al. (2019) 895,210 0.0226 128 312 74.52% (t 8000) 500/500 1000 CIFAR-100 All-CNN-C Springenberg et al. (2015) 1,387,108 0.1658 256 156 40.50% (t 8000) n.a./1000 1000 Image Net Res Net-18 He et al. (2016) 11,689,512 0.1 150 8207 17.33% (t 8000) n.a./5000 1500 |
| Hardware Specification | Yes | Figure 18: Runtimes of main SEIGH operations to compute a single Hessian eigendecomposition, assuming a single computer with 400GB RAM equipped with an NVIDIA A100 (40GB) graphics card. |
| Software Dependencies | No | We used Py Torch Paszke et al. (2019). Curv Lin Ops |
| Experiment Setup | Yes | Table 1: Overview of experimental settings, detailing number of model parameters (D), learning rate (η), batch size (B), steps per epoch (T), test accuracy (acc) at step t, number of train/test samples used to compute Htrain/Htest (Ntrain{Ntest respectively), and number of SEIGH outer measurements (n O, see Alg. 2) Problem Model D η B T acc Ntrain{Ntest n O 16ˆ16 MNIST tanh-MLP Martens & Grosse (2015) 7030 0.3 500 100 95.78% (t 1000) 500/500 355 CIFAR-10 3c3d-CNN Schneider et al. (2019) 895,210 0.0226 128 312 74.52% (t 8000) 500/500 1000 CIFAR-100 All-CNN-C Springenberg et al. (2015) 1,387,108 0.1658 256 156 40.50% (t 8000) n.a./1000 1000 Image Net Res Net-18 He et al. (2016) 11,689,512 0.1 150 8207 17.33% (t 8000) n.a./5000 1500 |