Accelerated Training on Low-Power Edge Devices

Authors: Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Osama Abboud, Ramin Khalili, Heba Khdr, Joerg Henkel

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by up to 2.3 with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process.
Researcher Affiliation Collaboration Mohamed Aboelenien Ahmed EMAIL Karlsruhe Institute of Technology Kilian Pfeiffer EMAIL Karlsruhe Institute of Technology Osama Abboud EMAIL Huawei Research Center Munich Ramin Khalili EMAIL Huawei Research Center Munich Heba Khdr EMAIL Karlsruhe Institute of Technology Jörg Henkel EMAIL Karlsruhe Institute of Technology
Pseudocode Yes Algorithm 1 Batch size and GPU frequency selection
Open Source Code No The paper mentions 'char-rnn. https://github.com/karpathy/char-rnn, 2015.' as a citation for a dataset/model used, but does not provide a statement or link for the source code of the methodology described in this paper.
Open Datasets Yes For image classification, a model is pre-trained with the full CIFAR100 Krizhevsky et al. (2009) and to be trained on subsets (i.e., quarter) of SVHN and CINIC datasets on the device. We use a subset of CIFAR10 Krizhevsky et al. (2009) as the proxy dataset on the server. For the next character prediction task, we evaluate a 6 layer transformers with 6 attention heads per attention block, 256 embedding dimensions, and a sequence length of 64. We use Adam W Loshchilov & Hutter (2017) as an optimizer. We pre-train the model on Wiki Text-2 dataset Merity et al. (2016), utilize tiny shakespeare Karpathy (2015) as a proxy dataset, and train it on some Jane Austin and Charles dickens novels.
Dataset Splits Yes Each dataset (subset) is divided into training and testing sets with an 80/20 split, with target accuracy evaluated on the test split. ... We use 90% of a dataset for training and the rest for testing.
Hardware Specification Yes We evaluate our method on Nvidia Jetson Nano with 4 GB memory on three scenarios. ... To show that our solution and our results are not device specific, we also provide evaluation on another device (i.e., Nvidia Jetson TX2NX). ... For pretraining and proxy datasets training we use NVIDIA A6000 GPU with Pytorch 2.1.
Software Dependencies Yes We use Py Torch 1.10 Paszke et al. (2019) for Jetson Nano and TX2NX. For pretraining and proxy datasets training we use NVIDIA A6000 GPU with Pytorch 2.1.
Experiment Setup Yes We use Adam optimizer Kingma & Ba (2015). ... The choice of appropriate learning rate and batch size are often intertwined, as they impact each other s effectiveness. Larger batch sizes provide more stable gradient estimates, potentially permitting the use of higher learning rates. Therefore, to preserve the performance of deep models with different batch sizes, we apply learning rate scaling (i.e., square root scaling Krizhevsky (2014) for Adam and Adam W). For Res Net18, the batch sizes ranged from 4 to 128, consisting exclusively of powers of two. The initial learning rate of 5 10 4 is used for the largest batch size of 128 (with learning rates scaled for other batch sizes). The same setup was also applied to Mobile Net V2 and Efficient Vi T; however, the batch size of 128 was omitted due to memory constraints. For transformers, we similarly consider batch sizes of 4 to 128 with a learning rate of 1 10 3 for the batch size of 128.