The Directionality of Optimization Trajectories in Neural Networks

Authors: Sidak Pal Singh, Bobby He, Thomas Hofmann, Bernhard Schölkopf

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive analysis across vision and language modeling tasks reveals that (a) the trajectory s directionality at the macro-level saturates by the initial phase of training, wherein weight decay and momentum play a crucial but understated role; and (b) in subsequent training, trajectory directionality manifests in micro-level behaviors, such as oscillations, for which we also provide a theoretical analysis. We experiment with Res Net20 on CIFAR10 using SGD over 160 epochs, freezing non-BN layers at different points in training, and then using only the 1, 376 scalar parameters in the BN layers for the remaining training duration.
Researcher Affiliation Academia Sidak Pal Singh ETH Zürich & MPI-IS Tübingen Bobby He ETH Zürich Thomas Hofmann ETH Zürich Bernhard Schölkopf MPI-IS Tübingen & ETH Zürich
Pseudocode No The paper describes methods using mathematical formulations and descriptive text, such as in Section 2 METHODOLOGY and Section 4.2 THEORETICAL MODELLING OF THE UNDERLYING MECHANISM, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Our comprehensive analysis across vision and language modeling tasks reveals that... We employ the standard SGD-based training recipe1 for Res Net50 on Image Net... Similarly, we also evaluate VGG16 on CIFAR10... Thanks to Pythia s (Biderman et al., 2023) publicly released model checkpoints over training, for GPT-Neo X (Black et al., 2022) models ranging in sizes from 14 Million (M) to 12 Billion (B)... We experiment with Res Net20 on CIFAR10...
Dataset Splits Yes We employ the standard SGD-based training recipe1 for Res Net50 on Image Net that achieves a top-1 accuracy of 76%... We experiment with Res Net20 on CIFAR10 using SGD over 160 epochs... The use of well-known benchmark datasets like ImageNet and CIFAR10 implies the use of their standard, predefined training, validation, and test splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments. It only mentions general computational resources without specific identifying information.
Software Dependencies No The paper mentions various optimizers and models (e.g., SGD, Adam W, Pythia, GPT-Neo X) but does not specify software dependencies with version numbers (e.g., Python version, specific deep learning framework versions like PyTorch or TensorFlow, or library versions).
Experiment Setup Yes Namely, this consists of training for 90 epochs with a learning rate η 0.1 (decayed multiplicatively by a factor of 0.1 at epochs 30 and 60), momentum µ 0.9, batch size B 256, and weight decay λ 0.0001.