reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tilted Sharpness-Aware Minimization

Authors: Tian Li, Tianyi Zhou, Jeff Bilmes

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks. [...] We empirically demonstrate that TSAM results in flatter solutions and superior generalization performance than SAM and its variants for deep neural networks including transformers on both image and text datasets (Section 5).
Researcher Affiliation	Academia	1University of Chicago 2University of Maryland, College Park 3University of Washington. Correspondence to: Tian Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 Tilted SAM Solver [...] Algorithm 3 Sampling from eδL(θi+ϵ) where ϵ ρ
Open Source Code	Yes	Our code is publicly available at github.com/litian96/TSAM.
Open Datasets	Yes	First, we explore training Res Net18 (He et al., 2016) and Wide Res Net16-8 (Zagoruyko, 2016) on CIFAR100 (Krizhevsky et al., 2009). [...] We study the performance of finetuning Vi Ts (pretrained on Image Net (Deng et al., 2009)) on an out-of-distribution Describable Texture Dataset (DTD) (Cimpoi et al., 2014), where the task is 47-class classification. [...] Additionally, we evaluate a 200-class classification task for Tiny Imagenet (Le & Yang, 2015) with Res Net18 and Res Net34 (He et al., 2016) models. Lastly, for text data, we study finetuning a pretrained Distil BERT (Sanh, 2019) model on the GLUE benchmark including both classification and regression problems.
Dataset Splits	No	The paper mentions specific datasets (CIFAR100, DTD, ImageNet, Tiny Imagenet, GLUE benchmark) but does not explicitly provide details about the train/test/validation splits used for these datasets, such as percentages, sample counts, or references to specific predefined splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments (e.g., GPU models, CPU types, or cloud computing specifications).
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., specific library versions for PyTorch, TensorFlow, etc.) within the main text or appendix.
Experiment Setup	Yes	Hyperparameter Tuning. We take µ(ϵ) to be ϵ ρ for all TSAM experiments, and tune the ρ parameters separately from {0.05, 0.1, 0.2} for relevant methods. For TSAM, we tune t from {0, 1, 5, 20, 100} and select the best one based on the validation set. [...] We use s=3 or s=5 sampled ϵ s for all datasets and find that it works well. [...] The batch size is 64 for all the datasets and methods and a constant learning rate is tuned from {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1} for each algorithm.