How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies
Authors: Piotr Kubaty, Bartosz Wójcik, Bartłomiej Tomasz Krzepkowski, Monika Michaluk, Tomasz Trzcinski, Jary Pomponi, Kamil Adamczewski
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods, we present the strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy. (...) 4. Empirical Evaluation of Training Regimes |
| Researcher Affiliation | Collaboration | 1Jagiellonian University 2Warsaw University of Technology 3University of Warsaw 4Tooploox 5IDEAS Research Institute 6Department of Information Engineering, Electronics, and Telecommunications (DIET) at Sapienza, University of Rome, Italy 7Wroclaw University of Science and Technology. |
| Pseudocode | No | The paper defines three phases of training using mathematical equations (Equations 1, 2, 3) and describes the training regimes in terms of these phases. However, it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | In this section, we outline the setup for our empirical experiments. We release the source code of our experiments at: https://github.com/kamadforge/early-exit-benchmark. A more detailed description can be found in Appendix E. |
| Open Datasets | Yes | For CV, we utilize CIFAR-100 (Krizhevsky, 2009), Image Net-1k (Russakovsky et al., 2015), Tiny Image Net (Le & Yang, 2015), and Imagenette (Howard, 2019). For NLP, we evaluate on 20-Newsgroups (Lang, 1995) and STS-B (Wang et al., 2019) datasets. |
| Dataset Splits | No | To ensure fair convergence across different regimes, we incorporate an early stopping mechanism. Training is terminated only when, over n consecutive epochs, none of the exits achieve an improved performance compared to their best scores recorded thus far. These scores accuracy for classification tasks and loss for regression tasks are evaluated on a dedicated early-stopping validation set. |
| Hardware Specification | Yes | E.5. Vi T-T, Image Net-1k: We train each model using 4 A-100 GPUs with an effective batch size of 2048. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and Cosine Annealing scheduler, and pretrained weights from torchvision (maintainers & contributors, 2016). However, specific version numbers for software libraries like PyTorch, torchvision, or other dependencies are not provided, only the initial publication year for torchvision. |
| Experiment Setup | Yes | E.1. Res Net-34, CIFAR-100: Training set-up. We train each model with batch size of 128. We use a learning rate of 5e 4 and no weight decay. We set the early stopping patience to 50 epochs. Cut Mix and Mixup are used as augmentations. (...) E.3. Vi T-T, CIFAR-100: Training set-up We train each model with batch size of 256. We use a learning rate of 5e 4 and no weight decay. We set the early stopping patience to 30 epochs. Following augmentations are used: random resizing, cropping, rotation, contrast adjustment, random erasing, Cut Mix and Mixup. |