reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How far away are truly hyperparameter-free learning algorithms?

Authors: Priya Kasimbeg, Vincent Roulet, Naman Agarwal, Sourabh Medapati, Fabian Pedregosa, Atish Agarwala, George E. Dahl

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we evaluate the potential of learning-rate-free methods as components of hyperparameter-free methods. We freeze their (non-learning rate) hyperparameters to default values, and score their performance using the recently-proposed Algo Perf: Training Algorithms benchmark. We found that literature-supplied default settings performed poorly on the benchmark, so we performed a search for hyperparameter conﬁgurations that performed well across all workloads simultaneously. The best Algo Perf-calibrated learning-ratefree methods had much improved performance but still lagged slightly behind a similarly calibrated Nadam W baseline in overall benchmark score.
Researcher Affiliation	Industry	Priya Kasimbeg EMAIL Google Deep Mind Vincent Roulet EMAIL Google Deep Mind ... Sourabh Medapati EMAIL Google Deep Mind Fabian Pedregosa EMAIL Google Deep Mind Atish Agarwala EMAIL Google Deep Mind George E. Dahl EMAIL Google Deep Mind
Pseudocode	Yes	Algorithm 1: Adam with D-Adapt Algorithm 2: Prodigy algorithm Algorithm 3: Mechanic: A Learning Rate Tuner. Algorithm 4: Mo Mo: Model-based Momentum method.
Open Source Code	No	The paper mentions "publicly available libraries" (Section 4) and "oﬃcial implementation" for Prodigy (Section 2.3), and a "JAX implementation of Schedule-Free Adam" (Appendix F), but these refer to third-party or general implementations and not specific source code released by the authors for the methodology presented in this paper. No direct link or explicit statement of code release by the authors is provided.
Open Datasets	Yes	We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark (Dahl et al., 2023). The Algo Perf workloads span a diverse collection of architectures and datasets across image, text, speech, and graph domains. Each workload is speciﬁed by a Dataset, Model, Loss, and evaluation metric (see Table 2 for a summary).
Dataset Splits	Yes	We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark on the self-tuning track (Dahl et al., 2023). ... The Algo Perf benchmark adopted performance proﬁles (Dolan and Moré, 2002), which are a convenient measurement of the performance of a set of algorithms agregated across a set of workloads. ... We set a validation evaluation metric target value, and measured how quickly the target is reached (or if it is reached at all within the budget).
Hardware Specification	Yes	The workloads in this study were trained on TPUv2 2x2 slices (16GB HBM per chip), with the exception of Criteo 1TB DLRM small which was trained on TPUv3 4x4 (32GB HBM per chip).
Software Dependencies	No	The paper mentions "Optax implementations" in Appendix A and a "JAX implementation of Schedule-Free Adam" in Appendix F. However, it does not provide specific version numbers for these or any other software libraries or frameworks used.
Experiment Setup	Yes	We speciﬁed the optimizer s hyperparameters, regularization hyperparameters, and a learning rate schedule (or equivalent), and applied those same settings across all workloads individually. The only workload-speciﬁc data were the measurements returned each step by the training interface, as well as the horizon of the learning rate schedule which was deﬁned as a fraction of the maximum steps per workload. ... Table 3: Search space for evidence-based search procedure. Parameter Scale Range Base Learning Rate Log [1e-4, 5e-2] Warmup Discrete {0.02, 0.05, 0.1} Weight Decay Log [1e-5, 0.5] 1 β1 Log [1e-3, 1.0] 1 β2 Log [1e-3, 1.0] Dropouts (tied) Discrete {0.0, 0.1} Label Smoothing Discrete {0.0, 0.2}