Accelerated training through iterative gradient propagation along the residual path

Authors: Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an extensive empirical study on a large selection of tasks and models, we evaluate Highway-BP and show that major speedups can be achieved with minimal performance degradation.
Researcher Affiliation Collaboration Erwan Fagnou1, Paul Caillon1, Blaise Delattre1,2 & Alexandre Allauzen1,3 1 Miles Team, LAMSADE, Universit e Paris Dauphine-PSL, Paris, France 2 Foxstream, Vaulx-en-Velin, France 3 ESPCI PSL, Paris, France
Pseudocode Yes We use Hillis and Steele s parallel algorithm (Hillis & Steele, 1986) in our experiments, and we indicate a pseudocode of this algorithm adapted to our needs in Appendix B (Algorithm 1). ... Appendix B PSEUDOCODE OF THE PARALLEL PREFIX SCAN ALGORITHM FOR CUMSUMPROD Algorithm 1 (Parallel Cum Sum Prod)
Open Source Code No The paper does not explicitly state that the code is publicly available or provide a link to a repository. It mentions "We leave its practical implementation for training large models in a distributed setting for future work." and "Still, we believe the prefix scan algorithm could be much more optimized, using a custom CUDA kernel for instance." indicating future work rather than current release.
Open Datasets Yes CIFAR10 The CIFAR10 (Krizhevsky, 2009) dataset contains 50k images with 10 classes. CIFAR10 pixel-level In Long Range Arena (Tay et al., 2021), CIFAR10 images are flattened as sequences of 3-dimensional vectors. Image Net32 The Image Net32 (Chrabaszcz et al., 2017) dataset contains 1.3M images with 1000 classes. Wikitext103 Wikitext103 is a dataset containing texts extracted from Wikipedia. MNLI The Multi-Genre Natural Language Inference dataset (Williams et al., 2018) is a task from the GLUE benchmark (Wang et al., 2018).
Dataset Splits No The paper uses standard datasets like CIFAR10, Image Net32, Wikitext103, and MNLI, which typically have predefined splits. However, it does not explicitly state the specific training/validation/test split percentages or sample counts used for these datasets within the paper's text. For example, for CIFAR10 it says "We process this the same way as CIFAR10" when referring to ImageNet32, which implies standard splits, but doesn't specify them.
Hardware Specification Yes All experiments were conducted on single GPUs, either Nvidia A100, A40, or RTX A6000.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma & Ba, 2014) and AdamW variation (Loshchilov & Hutter, 2017), as well as a cosine learning rate scheduler, but does not specify the versions of the software frameworks (e.g., PyTorch, TensorFlow) or specific Python libraries used.
Experiment Setup Yes Table 4: Hyperparameters used in the deep models experiments. Table 5: Hyperparameters used in the RNN experiments. Most models are trained using the Adam optimizer (Kingma & Ba, 2014). In case of weight decay, we use the Adam W variation (Loshchilov & Hutter, 2017). We also use a cosine learning rate scheduler to decrease the learning rate to a tenth of its initial value. Additionally, the first 10% of the training is performed with a linear warmup.