reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies

Authors: Piotr Kubaty, Bartosz Wójcik, Bartłomiej Tomasz Krzepkowski, Monika Michaluk, Tomasz Trzcinski, Jary Pomponi, Kamil Adamczewski

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods, we present the strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy. (...) 4. Empirical Evaluation of Training Regimes
Researcher Affiliation	Collaboration	1Jagiellonian University 2Warsaw University of Technology 3University of Warsaw 4Tooploox 5IDEAS Research Institute 6Department of Information Engineering, Electronics, and Telecommunications (DIET) at Sapienza, University of Rome, Italy 7Wroclaw University of Science and Technology.
Pseudocode	No	The paper defines three phases of training using mathematical equations (Equations 1, 2, 3) and describes the training regimes in terms of these phases. However, it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	In this section, we outline the setup for our empirical experiments. We release the source code of our experiments at: https://github.com/kamadforge/early-exit-benchmark. A more detailed description can be found in Appendix E.
Open Datasets	Yes	For CV, we utilize CIFAR-100 (Krizhevsky, 2009), Image Net-1k (Russakovsky et al., 2015), Tiny Image Net (Le & Yang, 2015), and Imagenette (Howard, 2019). For NLP, we evaluate on 20-Newsgroups (Lang, 1995) and STS-B (Wang et al., 2019) datasets.
Dataset Splits	No	To ensure fair convergence across different regimes, we incorporate an early stopping mechanism. Training is terminated only when, over n consecutive epochs, none of the exits achieve an improved performance compared to their best scores recorded thus far. These scores accuracy for classification tasks and loss for regression tasks are evaluated on a dedicated early-stopping validation set.
Hardware Specification	Yes	E.5. Vi T-T, Image Net-1k: We train each model using 4 A-100 GPUs with an effective batch size of 2048.
Software Dependencies	No	The paper mentions using the Adam W optimizer and Cosine Annealing scheduler, and pretrained weights from torchvision (maintainers & contributors, 2016). However, specific version numbers for software libraries like PyTorch, torchvision, or other dependencies are not provided, only the initial publication year for torchvision.
Experiment Setup	Yes	E.1. Res Net-34, CIFAR-100: Training set-up. We train each model with batch size of 128. We use a learning rate of 5e 4 and no weight decay. We set the early stopping patience to 50 epochs. Cut Mix and Mixup are used as augmentations. (...) E.3. Vi T-T, CIFAR-100: Training set-up We train each model with batch size of 256. We use a learning rate of 5e 4 and no weight decay. We set the early stopping patience to 30 epochs. Following augmentations are used: random resizing, cropping, rotation, contrast adjustment, random erasing, Cut Mix and Mixup.