reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MeRino: Entropy-Driven Design for Generative Language Models on IoT Devices

Authors: Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our designed models, termed Me Rino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, Me Rino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9 faster on NVIDIA Jetson Nano with 5.5 reduction in model size.
Researcher Affiliation	Collaboration	Youpeng Zhao1, Ming Lin2, Huadong Tang3, Qiang Wu3, Jun Wang1 1Department of Computer Science, University of Central Florida 2Independent Researcher 3School of Electrical and Data Engineering, University of Technology Sydney EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	To solve this optimization problem, we employ an Evolutionary Algorithm (Reeves 2007). Note that Eq. (15) can be solved by any non-linear programming solver in principle. We choose EA due to its simplicity. Due to the page limit, detailed descriptions of the EA and mutation algorithm are omitted.
Open Source Code	No	The paper discusses various models and datasets but does not explicitly state that the authors' implementation code for Me Rino is open-source or provide a link to a code repository for their method.
Open Datasets	Yes	For pre-training, we use the publicly available Pile dataset (Gao et al. 2020), which is pre-processed by removing duplication and tokenized using byte-level encoding. For evaluation, we evaluate our models across fourteen different downstream NLP tasks, namely Wiki Text2 (Merity et al. 2016), Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), Hella Swag (Zellers et al. 2019), Wino Grande (Sakaguchi et al. 2019), Open Book QA (Mihaylov et al. 2018), ARC (Clark et al. 2018), Pubmed QA (Jin et al. 2019), Logi QA (Liu et al. 2020), and Super GLUE (Wang et al. 2019) benchmark Bool Q, CB, WIC, WSC and RTE.
Dataset Splits	No	The paper mentions using specific datasets for pre-training and evaluation, but it does not specify the exact training, validation, or test splits used for these datasets, such as percentages, sample counts, or references to predefined split files.
Hardware Specification	Yes	Notably, Me Rino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9 faster on NVIDIA Jetson Nano with 5.5 reduction in model size...We conduct latency experiments on NVIDIA Jetson Nano GPU 8GB...and train each model from scratch for 600k steps...on 8 NVIDIA H100 80 GB GPUs...it only takes about 0.05 hours to run on NVIDIA Jetson Nano, while TE-NAS consumes around 1.2 hours with a single NVIDIA GTX 1080Ti GPU.
Software Dependencies	No	For the learning rate schedule, we follow (Biderman et al. 2023) and adopt Adam W (Loshchilov and Hutter 2017) optimizer...For evaluation, we adopt the codebase of lm-evaluation-harness (Gao et al. 2021) for a fair comparison.
Experiment Setup	Yes	We follow the settings in (Zhang et al. 2022) and train each model from scratch for 600k steps with an effective batch size of 1024 and sequence length of 1024 on 8 NVIDIA H100 80 GB GPUs. For the learning rate schedule, we follow (Biderman et al. 2023) and adopt Adam W (Loshchilov and Hutter 2017) optimizer, with a starting learning rate of 6e-4, warm-up steps of 1000, and linear learning rate decay.