Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism

Authors: Tehila Dahan, Kfir Y Levy

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies further validate the robustness and enhanced stability of our approach. We empirically demonstrate the improved stability and performance of our methods over various baselines, confirming both the theoretical and practical advantages of our approach. Section 6: EXPERIMENTS
Researcher Affiliation Academia Tehila Dahan ECE Department Technion Haifa, Israel EMAIL Kfir Y. Levy ECE Department Technion Haifa, Israel EMAIL
Pseudocode Yes Algorithm 1 µ2-SGD. Algorithm 2 µ2 Extra SGD.
Open Source Code Yes our Git Hub repository.1 https://github.com/dahan198/mu2sgd
Open Datasets Yes The evaluation is conducted on the MNIST dataset (Le Cun et al., 2010), using a logistic regression model. We demonstrate the effectiveness of our approach in non-convex settings using a 2-layer convolutional network on the MNIST dataset and Res Net-18 on the CIFAR-10 dataset (Krizhevsky et al., 2014).
Dataset Splits No The paper mentions that for MNIST, "Both the training and testing phases employed mini-batches of size 64, with one full pass (epoch) over the dataset." For CIFAR-10, "We trained Res Net-18 for 25 epochs using mini-batches of size 32." While it indicates data usage for training and testing, it does not provide specific split percentages, sample counts, or explicit references to predefined train/test/validation splits.
Hardware Specification Yes The convex experiments were run on an Apple M2 chip, while the non-convex experiments were executed on an NVIDIA A30 GPU.
Software Dependencies No All experiments were conducted using the Py Torch framework. The paper mentions PyTorch but does not specify a version number.
Experiment Setup Yes We compared the following optimization algorithms over a range of fixed learning rates. The convex experiments were run on the MNIST dataset... Both the training and testing phases employed mini-batches of size 64, with one full pass (epoch) over the dataset. The following algorithms were evaluated with their respective parameter settings: µ2-SGD with αt = t and βt = 1/t, STORM with βt = 1/t, and Anytime-SGD with αt = t. We trained Res Net-18 for 25 epochs using mini-batches of size 32. Training included Random Crop (32 32, padding=2, p=0.5) and Random Horizontal Flip (p=0.5) for data augmentation. The following algorithms were evaluated with their respective fixed parameter settings: µ2-SGD with γt = 0.1 and βt = 0.9, STORM with βt = 0.9, and Anytime-SGD with γt = 0.1.