How far away are truly hyperparameter-free learning algorithms?
Authors: Priya Kasimbeg, Vincent Roulet, Naman Agarwal, Sourabh Medapati, Fabian Pedregosa, Atish Agarwala, George E. Dahl
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we evaluate the potential of learning-rate-free methods as components of hyperparameter-free methods. We freeze their (non-learning rate) hyperparameters to default values, and score their performance using the recently-proposed Algo Perf: Training Algorithms benchmark. We found that literature-supplied default settings performed poorly on the benchmark, so we performed a search for hyperparameter configurations that performed well across all workloads simultaneously. The best Algo Perf-calibrated learning-ratefree methods had much improved performance but still lagged slightly behind a similarly calibrated Nadam W baseline in overall benchmark score. |
| Researcher Affiliation | Industry | Priya Kasimbeg EMAIL Google Deep Mind Vincent Roulet EMAIL Google Deep Mind ... Sourabh Medapati EMAIL Google Deep Mind Fabian Pedregosa EMAIL Google Deep Mind Atish Agarwala EMAIL Google Deep Mind George E. Dahl EMAIL Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Adam with D-Adapt Algorithm 2: Prodigy algorithm Algorithm 3: Mechanic: A Learning Rate Tuner. Algorithm 4: Mo Mo: Model-based Momentum method. |
| Open Source Code | No | The paper mentions "publicly available libraries" (Section 4) and "official implementation" for Prodigy (Section 2.3), and a "JAX implementation of Schedule-Free Adam" (Appendix F), but these refer to third-party or general implementations and not specific source code released by the authors for the methodology presented in this paper. No direct link or explicit statement of code release by the authors is provided. |
| Open Datasets | Yes | We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark (Dahl et al., 2023). The Algo Perf workloads span a diverse collection of architectures and datasets across image, text, speech, and graph domains. Each workload is specified by a Dataset, Model, Loss, and evaluation metric (see Table 2 for a summary). |
| Dataset Splits | Yes | We studied the performance of the candidate algorithms by training 8 workloads from the MLCommons Algo Perf: Training Algorithms benchmark on the self-tuning track (Dahl et al., 2023). ... The Algo Perf benchmark adopted performance profiles (Dolan and Moré, 2002), which are a convenient measurement of the performance of a set of algorithms agregated across a set of workloads. ... We set a validation evaluation metric target value, and measured how quickly the target is reached (or if it is reached at all within the budget). |
| Hardware Specification | Yes | The workloads in this study were trained on TPUv2 2x2 slices (16GB HBM per chip), with the exception of Criteo 1TB DLRM small which was trained on TPUv3 4x4 (32GB HBM per chip). |
| Software Dependencies | No | The paper mentions "Optax implementations" in Appendix A and a "JAX implementation of Schedule-Free Adam" in Appendix F. However, it does not provide specific version numbers for these or any other software libraries or frameworks used. |
| Experiment Setup | Yes | We specified the optimizer s hyperparameters, regularization hyperparameters, and a learning rate schedule (or equivalent), and applied those same settings across all workloads individually. The only workload-specific data were the measurements returned each step by the training interface, as well as the horizon of the learning rate schedule which was defined as a fraction of the maximum steps per workload. ... Table 3: Search space for evidence-based search procedure. Parameter Scale Range Base Learning Rate Log [1e-4, 5e-2] Warmup Discrete {0.02, 0.05, 0.1} Weight Decay Log [1e-5, 0.5] 1 β1 Log [1e-3, 1.0] 1 β2 Log [1e-3, 1.0] Dropouts (tied) Discrete {0.0, 0.1} Label Smoothing Discrete {0.0, 0.2} |