Investigating the Overlooked Hessian Structure: From CNNs to LLMs
Authors: Qian-Yuan Tang, Yufei Gu, Yunfeng Cai, Mingming Sun, Ping Li, Zhou Xun, Zeke Xie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments using the proposed power-law spectral method demonstrate that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Notably, we discover that the power-law Hessian structure of a given LLM can often predict generalization during training in some occasions, while conventional sharpnessbased generalization measures which often work well on CNNs largely fail as an effective generalization predictor of LLMs. |
| Researcher Affiliation | Collaboration | 1Department of Physics, Hong Kong Baptist University 2x Lea F Lab, The Hong Kong University of Science and Technology (Guangzhou) 3BIMSA 4AGI Lab, BIMSA 5Rutgers University 6Seed-Foundation-Model Team, Byte Dance. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. Methodologies are described in narrative text and figures. |
| Open Source Code | No | The paper mentions utilizing third-party codebases for their experiments (e.g., "We utilized the Stochastic Lanczos Quadrature (SLQ) algorithm implementation from Yao et al. (2020)" and "We use the code base of Nano GPT (Karpathy, 2022) for reproducing all GPT-2 models"), but it does not provide an explicit statement or link for the authors' own implementation code related to the work described in this paper. |
| Open Datasets | Yes | Datasets: MNIST (Le Cun, 1998), Fashion-MNIST (Xiao et al., 2017), CIFAR-10/100 (Krizhevsky & Hinton, 2009), and non-image Avila (De Stefano et al., 2018). Models: GPT-2 family (Radford et al., 2019): GPT2-nano (11M), GPT2-small (124M), GPT2-medium (355M), and GPT2-large (774M), and Tiny Llama (Zhang et al., 2024a) (1.1B-Chat-v1.0) with Lo RA adapter (Hu et al., 2021). Datasets: Open Web Text (Gokaslan et al., 2019), Shakespeare (Karpathy, 2015), and Math QA (Amini et al., 2019). |
| Dataset Splits | No | The paper mentions using well-known datasets like MNIST, Fashion-MNIST, CIFAR-10/100, Open Web Text, Shakespeare, and Math QA. While these datasets typically have standard splits, the paper does not explicitly state the percentages or counts for training, validation, and test splits, nor does it cite a specific source for the splits used. |
| Hardware Specification | Yes | The image classification experiments are conducted on a computing cluster with NVIDIA V100/H800 GPUs and Intel Xeon CPUs. |
| Software Dependencies | No | The paper mentions software tools like the "Powerlaw library (Alstott et al., 2014)", "Nano GPT (Karpathy, 2022)", and the "Stochastic Lanczos Quadrature (SLQ) algorithm implementation from Yao et al. (2020)". However, it does not provide specific version numbers for any of these libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow) used in their experiments. |
| Experiment Setup | Yes | Hyperparameter Settings: We select the optimal learning rate for each experiment from {0.0001, 0.001, 0.01, 0.1, 1, 10} for SGD and use the default learning rate for adaptive gradient methods. In the experiments on MNIST and Fashion-MNIST: η = 0.1 for SGD, Vanilla SGD, Adai, PNM, and Lookahead; η = 0.1 for Vanilla SGD;η = 0.001 for Adam, AMSGrad, Ada Bound, Yogi, RAdam, and Diff Grad. We train neural networks for 50 epochs on MNIST and 200 epochs on Fashion-MNIST. For the learning rate schedule, the learning rate is divided by 10 at the epoch of 40% and 80%. The batch size is set to 128 for MNIST and Fashion-MNIST, unless we specify it otherwise. The strength of weight decay defaults to λ = 0.0005 as the baseline for all optimizers unless we specify it otherwise. We set the momentum hyperparameter β1 = 0.9 for SGD and adaptive gradient methods which involve in Momentum. |