Efficient Distributed Optimization under Heavy-Tailed Noise

Authors: Su Hyeong Lee, Manzil Zaheer, Tian Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Tail OPT, including Bi2Clip, demonstrates superior performance on various tasks and models compared with state-of-the-art methods, while being more efficient. Our contributions may be summarized as follows. [...] We validate the practicality and effectiveness of Tail OPT through extensive experiments on synthetic and real-world datasets in large-scale settings. Our experiments demonstrate that Tail OPT produces several algorithmic instantiations that consistently outperform state-of-the-art baselines while being more efficient. Section 6. Experiments: We assess the performance of various Tail OPT instantiations across a range of empirical tasks, benchmarking them against state-of-the-art algorithms from the literature. Our experiments include synthetic tasks with heavytailed noise injection and real-world benchmarks, including GLUE (Wang et al., 2019) for natural language understanding, WMT (Foundation, 2019) for machine translation, Qwen2.5 (Yang et al., 2025) for question answering, and Vi T (Dosovitskiy et al., 2021) for image classification.
Researcher Affiliation Collaboration 1Department of Statistics, University of Chicago 2Meta 3Department of Computer Science, University of Chicago. Correspondence to: Su Hyeong Lee <EMAIL>.
Pseudocode Yes Algorithm 1 Heavy-Tailed Optimization (Tail OPT) Require: Initial model x1, learning rate schedule ηt Clipping schedules ut dt 0, Synchronization timestep z Z>0 [...] Algorithm 4 Bi2Clip Require: Initial model x1, learning rate schedule ηt, clipping schedules ut, dt, eut, edt Synchronization timestep z Z>0
Open Source Code Yes Our code are publicly available at github.com/sulee3/Heavy Tails.
Open Datasets Yes We assess the performance of various Tail OPT instantiations across a range of empirical tasks, benchmarking them against state-of-the-art algorithms from the literature. Our experiments include synthetic tasks with heavy-tailed noise injection and real-world benchmarks, including GLUE (Wang et al., 2019) for natural language understanding, WMT (Foundation, 2019) for machine translation, Qwen2.5 (Yang et al., 2025) for question answering, and Vi T (Dosovitskiy et al., 2021) for image classification. [...] finetune Qwen2.5 (Yang et al., 2025) on the SQu AD dataset (Rajpurkar et al., 2016), and finetune a Vi T-base model (Dosovitskiy et al., 2021) (pretrained on Image Net) on CIFAR100 (Krizhevsky et al., 2009). [...] We utilized the LEAF repository (Caldas et al., 2018), originally a benchmark suite for federated learning, which provides datasets, tools, and baselines to evaluate algorithms under realworld conditions. Among the datasets in LEAF, we modified the Shakespeare dataset [...]. These texts were open sourced from Project Gutenberg.
Dataset Splits Yes The Philosopher Dataset was synthesized by allocating each literary work to one of eight compute nodes, followed by an 80-20 train-test split.
Hardware Specification No The training of deep learning models including large language models (LLMs) has become increasingly resourceintensive, driven by expansive datasets and models with billions of parameters (Rosa et al., 2022; Liu et al., 2024b; Sriram et al., 2022; Dehghani et al., 2023). As the computational demands escalate, distributed learning has emerged as the default approach, enabling the parallel activation of training processes across multiple compute nodes such as GPUs or datacenters. While the paper mentions 'GPUs or datacenters' and 'compute nodes', it does not specify exact models of GPUs, CPUs, or detailed hardware configurations. This information is too general to be considered a specific hardware description.
Software Dependencies No The paper makes no explicit mention of specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, specific library versions). It mentions using models like RoBERTa and T5, but these are architectures, not software packages with versions.
Experiment Setup Yes Extended details of the experimental setup, dataset descriptions, and extensive hyperparameter tuning procedures (including the best hyperparameters for each method and dataset) are provided in Appendix D. Our code are publicly available at github.com/sulee3/Heavy Tails. [...] D.7. Hyperparameter Sweep Grids: The sweep grids in Tables 8, 9 were determined by first performing a coarser sweep using an approximate grid, then localizing near the discovered well-performing hyperparameters. [...] D.8. Optimal Hyperparameters: In this subsection, we display the optimal hyperparameters located during our extensive sweep.