reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sassha: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

Authors: Dahun Shin, Dongyeop Lee, Jinseok Chung, Namhoon Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate the effectiveness of SASSHA across diverse vision and natural language tasks. Our results reveal that SASSHA consistently achieves flatter minima and attains stronger generalization performance, all compared to existing practical second-order methods, and interestingly, to first-order methods including SGD, Adam W, and SAM.
Researcher Affiliation	Academia	1POSTECH. Correspondence to: Dahun Shin <EMAIL>, Dongyeop Lee <EMAIL>.
Pseudocode	Yes	The exact steps of SASSHA is outlined in Algorithm 1.
Open Source Code	Yes	The code to reproduce all results reported in this work is made available for download at https://github.com/LOG-postech/Sassha.
Open Datasets	Yes	We first evaluate SASSHA for image classification on CIFAR-10, CIFAR-100, and Image Net. [...] Specifically, we train GPT1-mini, a scaled-down variant of GPT1 (Radford et al., 2019), on Wikitext-2 dataset (Merity et al., 2022) using various methods [...]. We also extend our evaluation to finetuning tasks. Specifically, we finetune Squeeze BERT (Iandola et al., 2020) for diverse tasks in the GLUE benchmark (Wang et al., 2018).
Dataset Splits	Yes	We first evaluate SASSHA for image classification on CIFAR-10, CIFAR-100, and Image Net. [...] We introduced label noise by randomly corrupting a fraction of the training data at rates of 20%, 40%, and 60%. The use of standard benchmark datasets like CIFAR-10/100 and ImageNet implies the use of their well-known, pre-defined train/test/validation splits, which are standard in the field.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It describes experimental settings in terms of datasets, models, and hyperparameters, but omits hardware specifications.
Software Dependencies	No	The paper describes experimental settings in detail but does not provide specific ancillary software details, such as library names with version numbers (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup	Yes	Here, we describe our experiment settings in detail. We evaluate SASSHA against Ada Hessian (Yao et al., 2021), Sophia-H (Liu et al., 2024), Shampoo (Gupta et al., 2018), SGD, Adam W (Loshchilov & Hutter, 2018), and SAM (Foret et al., 2021) across a diverse set of vision and language tasks. Across all evaluations except for language finetuning, we set lazy Hessian update interval to k = 10 for SASSHA. [...] All experiments were conducted with a batch size of 256. The hyperparameter search space for each method is detailed in Table 9.