Faster Stochastic Optimization with Arbitrary Delays via Adaptive Asynchronous Mini-Batching
Authors: Amit Attia, Ofir Gaash, Tomer Koren
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate the benefits of asynchronous mini-batching, we compare vanilla asynchronous SGD (denoted Async SGD) with a practical variant of our mini-batch method (Algorithm 1), which uses SGD, denoted Async-MB-SGD, for training a fully connected neural network on the Fashion MNIST classification dataset (Xiao et al., 2017).5 The dataset consists of 60,000 training images and 10,000 test images, each of size 28 28 pixels and labeled across 10 classes. We use test accuracy as the evaluation metric. |
| Researcher Affiliation | Collaboration | 1Blavatnik School of Computer Science, Tel Aviv University 2Google Research Tel Aviv. Correspondence to: Amit Attia <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Asynchronous mini-batching Algorithm 2: Asynchronous mini-batching sweep |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | training a fully connected neural network on the Fashion MNIST classification dataset (Xiao et al., 2017). |
| Dataset Splits | Yes | The dataset consists of 60,000 training images and 10,000 test images, each of size 28 28 pixels and labeled across 10 classes. |
| Hardware Specification | No | We adopt the two-phase asynchronous simulation framework of Cohen et al. (2021). In the first phase, we simulate compute times for each worker by drawing from a weighted mixture of two Poisson distributions. In the second phase, we simulate training by having each worker deliver gradients to a central server according to the generated compute schedule. |
| Software Dependencies | No | The paper mentions training a neural network and using cross-entropy loss, but does not specify any software names with version numbers (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | Each worker uses a local mini-batch of size 8. The learning rate is selected separately for each algorithm from a geometric grid with multiplicative factor 3/10: for Async-SGD we search over the range [0.001, 1.0], and for Async-MB-SGD over [0.01, 1.0]. For Async-MB-SGD, we additionally tune the aggregation batch size B (i.e., the number of updates the server accumulates before modifying the model) over the set {1, 2, 4, 8, 16, 32}. We conduct experiments with 40, 160, and 640 workers, using 7,500, 30,000, and 120,000 update steps, respectively. ... To reduce the variation of the last iterate, we use exponential moving averaging with decay 0.99. |