MeRino: Entropy-Driven Design for Generative Language Models on IoT Devices

Authors: Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our designed models, termed Me Rino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, Me Rino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9 faster on NVIDIA Jetson Nano with 5.5 reduction in model size.
Researcher Affiliation Collaboration Youpeng Zhao1, Ming Lin2, Huadong Tang3, Qiang Wu3, Jun Wang1 1Department of Computer Science, University of Central Florida 2Independent Researcher 3School of Electrical and Data Engineering, University of Technology Sydney EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No To solve this optimization problem, we employ an Evolutionary Algorithm (Reeves 2007). Note that Eq. (15) can be solved by any non-linear programming solver in principle. We choose EA due to its simplicity. Due to the page limit, detailed descriptions of the EA and mutation algorithm are omitted.
Open Source Code No The paper discusses various models and datasets but does not explicitly state that the authors' implementation code for Me Rino is open-source or provide a link to a code repository for their method.
Open Datasets Yes For pre-training, we use the publicly available Pile dataset (Gao et al. 2020), which is pre-processed by removing duplication and tokenized using byte-level encoding. For evaluation, we evaluate our models across fourteen different downstream NLP tasks, namely Wiki Text2 (Merity et al. 2016), Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2019), Open Book QA (Mihaylov et al. 2018), ARC (Clark et al. 2018), Pubmed QA (Jin et al. 2019), Logi QA (Liu et al. 2020), and Super GLUE (Wang et al. 2019) benchmark Bool Q, CB, WIC, WSC and RTE.
Dataset Splits No The paper mentions using specific datasets for pre-training and evaluation, but it does not specify the exact training, validation, or test splits used for these datasets, such as percentages, sample counts, or references to predefined split files.
Hardware Specification Yes Notably, Me Rino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9 faster on NVIDIA Jetson Nano with 5.5 reduction in model size...We conduct latency experiments on NVIDIA Jetson Nano GPU 8GB...and train each model from scratch for 600k steps...on 8 NVIDIA H100 80 GB GPUs...it only takes about 0.05 hours to run on NVIDIA Jetson Nano, while TE-NAS consumes around 1.2 hours with a single NVIDIA GTX 1080Ti GPU.
Software Dependencies No For the learning rate schedule, we follow (Biderman et al. 2023) and adopt Adam W (Loshchilov and Hutter 2017) optimizer...For evaluation, we adopt the codebase of lm-evaluation-harness (Gao et al. 2021) for a fair comparison.
Experiment Setup Yes We follow the settings in (Zhang et al. 2022) and train each model from scratch for 600k steps with an effective batch size of 1024 and sequence length of 1024 on 8 NVIDIA H100 80 GB GPUs. For the learning rate schedule, we follow (Biderman et al. 2023) and adopt Adam W (Loshchilov and Hutter 2017) optimizer, with a starting learning rate of 6e-4, warm-up steps of 1000, and linear learning rate decay.