reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Investigating the Overlooked Hessian Structure: From CNNs to LLMs

Authors: Qian-Yuan Tang, Yufei Gu, Yunfeng Cai, Mingming Sun, Ping Li, Zhou Xun, Zeke Xie

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments using the proposed power-law spectral method demonstrate that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Notably, we discover that the power-law Hessian structure of a given LLM can often predict generalization during training in some occasions, while conventional sharpnessbased generalization measures which often work well on CNNs largely fail as an effective generalization predictor of LLMs.
Researcher Affiliation	Collaboration	1Department of Physics, Hong Kong Baptist University 2x Lea F Lab, The Hong Kong University of Science and Technology (Guangzhou) 3BIMSA 4AGI Lab, BIMSA 5Rutgers University 6Seed-Foundation-Model Team, Byte Dance.
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the paper. Methodologies are described in narrative text and figures.
Open Source Code	No	The paper mentions utilizing third-party codebases for their experiments (e.g., "We utilized the Stochastic Lanczos Quadrature (SLQ) algorithm implementation from Yao et al. (2020)" and "We use the code base of Nano GPT (Karpathy, 2022) for reproducing all GPT-2 models"), but it does not provide an explicit statement or link for the authors' own implementation code related to the work described in this paper.
Open Datasets	Yes	Datasets: MNIST (Le Cun, 1998), Fashion-MNIST (Xiao et al., 2017), CIFAR-10/100 (Krizhevsky & Hinton, 2009), and non-image Avila (De Stefano et al., 2018). Models: GPT-2 family (Radford et al., 2019): GPT2-nano (11M), GPT2-small (124M), GPT2-medium (355M), and GPT2-large (774M), and Tiny Llama (Zhang et al., 2024a) (1.1B-Chat-v1.0) with Lo RA adapter (Hu et al., 2021). Datasets: Open Web Text (Gokaslan et al., 2019), Shakespeare (Karpathy, 2015), and Math QA (Amini et al., 2019).
Dataset Splits	No	The paper mentions using well-known datasets like MNIST, Fashion-MNIST, CIFAR-10/100, Open Web Text, Shakespeare, and Math QA. While these datasets typically have standard splits, the paper does not explicitly state the percentages or counts for training, validation, and test splits, nor does it cite a specific source for the splits used.
Hardware Specification	Yes	The image classification experiments are conducted on a computing cluster with NVIDIA V100/H800 GPUs and Intel Xeon CPUs.
Software Dependencies	No	The paper mentions software tools like the "Powerlaw library (Alstott et al., 2014)", "Nano GPT (Karpathy, 2022)", and the "Stochastic Lanczos Quadrature (SLQ) algorithm implementation from Yao et al. (2020)". However, it does not provide specific version numbers for any of these libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow) used in their experiments.
Experiment Setup	Yes	Hyperparameter Settings: We select the optimal learning rate for each experiment from {0.0001, 0.001, 0.01, 0.1, 1, 10} for SGD and use the default learning rate for adaptive gradient methods. In the experiments on MNIST and Fashion-MNIST: η = 0.1 for SGD, Vanilla SGD, Adai, PNM, and Lookahead; η = 0.1 for Vanilla SGD;η = 0.001 for Adam, AMSGrad, Ada Bound, Yogi, RAdam, and Diff Grad. We train neural networks for 50 epochs on MNIST and 200 epochs on Fashion-MNIST. For the learning rate schedule, the learning rate is divided by 10 at the epoch of 40% and 80%. The batch size is set to 128 for MNIST and Fashion-MNIST, unless we specify it otherwise. The strength of weight decay defaults to λ = 0.0005 as the baseline for all optimizers unless we specify it otherwise. We set the momentum hyperparameter β1 = 0.9 for SGD and adaptive gradient methods which involve in Momentum.