How far away are truly hyperparameter-free learning algorithms?

Authors: Priya Kasimbeg, Vincent Roulet, Naman Agarwal, Sourabh Medapati, Fabian Pedregosa, Atish Agarwala, George E. Dahl

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we evaluate the potential of learning-rate-free methods as components of hyperparameter-free methods. We freeze their (non-learning rate) hyperparameters to default values, and score their performance using the recently-proposed Algo Perf: Training Algorithms benchmark. We found that literature-supplied default settings performed poorly on the benchmark, so we performed a search for hyperparameter configurations that performed well across all workloads simultaneously. The best Algo Perf-calibrated learning-ratefree methods had much improved performance but still lagged slightly behind a similarly calibrated Nadam W baseline in overall benchmark score.
Researcher Affiliation Industry Priya Kasimbeg EMAIL Google Deep Mind Vincent Roulet EMAIL Google Deep Mind ... Sourabh Medapati EMAIL Google Deep Mind Fabian Pedregosa EMAIL Google Deep Mind Atish Agarwala EMAIL Google Deep Mind George E. Dahl EMAIL Google Deep Mind
Pseudocode Yes Algorithm 1: Adam with D-Adapt Algorithm 2: Prodigy algorithm Algorithm 3: Mechanic: A Learning Rate Tuner. Algorithm 4: Mo Mo: Model-based Momentum method.
Open Source Code No The paper mentions "publicly available libraries" (Section 4) and "official implementation" for Prodigy (Section 2.3), and a "JAX implementation of Schedule-Free Adam" (Appendix F), but these refer to third-party or general implementations and not specific source code released by the authors for the methodology presented in this paper. No direct link or explicit statement of code release by the authors is provided.
Open Datasets Yes We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark (Dahl et al., 2023). The Algo Perf workloads span a diverse collection of architectures and datasets across image, text, speech, and graph domains. Each workload is specified by a Dataset, Model, Loss, and evaluation metric (see Table 2 for a summary).
Dataset Splits Yes We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark on the self-tuning track (Dahl et al., 2023). ... The Algo Perf benchmark adopted performance profiles (Dolan and Moré, 2002), which are a convenient measurement of the performance of a set of algorithms agregated across a set of workloads. ... We set a validation evaluation metric target value, and measured how quickly the target is reached (or if it is reached at all within the budget).
Hardware Specification Yes The workloads in this study were trained on TPUv2 2x2 slices (16GB HBM per chip), with the exception of Criteo 1TB DLRM small which was trained on TPUv3 4x4 (32GB HBM per chip).
Software Dependencies No The paper mentions "Optax implementations" in Appendix A and a "JAX implementation of Schedule-Free Adam" in Appendix F. However, it does not provide specific version numbers for these or any other software libraries or frameworks used.
Experiment Setup Yes We specified the optimizer s hyperparameters, regularization hyperparameters, and a learning rate schedule (or equivalent), and applied those same settings across all workloads individually. The only workload-specific data were the measurements returned each step by the training interface, as well as the horizon of the learning rate schedule which was defined as a fraction of the maximum steps per workload. ... Table 3: Search space for evidence-based search procedure. Parameter Scale Range Base Learning Rate Log [1e-4, 5e-2] Warmup Discrete {0.02, 0.05, 0.1} Weight Decay Log [1e-5, 0.5] 1 β1 Log [1e-3, 1.0] 1 β2 Log [1e-3, 1.0] Dropouts (tied) Discrete {0.0, 0.1} Label Smoothing Discrete {0.0, 0.2}