reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DON’T STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS

Authors: Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our refined predictions from layer embeddings achieve 2.66x lower mean absolute error compared to BERT predictions from sequence prompts. TRAIL achieves 1.66x to 2.01x lower mean latency on the Alpaca dataset and 1.76x to 24.07x lower mean time to the first token compared to the state-of-the-art serving system.
Researcher Affiliation	Academia	Rana Shahout Harvard University Eran Malach Harvard University Chunwei Liu MIT Weifan Jiang Harvard University Minlan Yu Harvard University Michael Mitzenmacher Harvard University
Pseudocode	No	The paper discusses algorithmic concepts like SPRPT with limited preemption and provides a closed-form formula for it, but it does not present any structured pseudocode or algorithm blocks with step-by-step procedures in the main text or appendices.
Open Source Code	No	The paper states: "Our implementation is based on the open-source v LLM system (v0.5.0)", referring to a third-party system used, not their own code release. No explicit statement of code release or a link to the authors' own source code repository is provided.
Open Datasets	Yes	The workload is generated using the Alpaca dataset (Taori et al., 2023), derived from open-source conversational exchanges and originally used to fine-tune the Alpaca model. We sample 10k unique prompts from the dataset for model serving, distinct from those used to train the length predictor. Citing: Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Dataset Splits	Yes	The dataset was split such that 75% of the prompts were used for training and the remaining 25% for evaluation.
Hardware Specification	Yes	The evaluation is conducted on a server with a single NVIDIA A100 80GB GPU and 64 AMD EPYC 7313 16-Core Processor cores, with 503 Gi B of memory and running CUDA version 12.3. For testing multi-GPU settings, we used a machine with dual AMD EPYC 7313 CPUs (16 cores per CPU, totaling 64 threads), 503 GB of RAM, and two NVIDIA A100 GPUs with 80 GB memory each connected via NVLink.
Software Dependencies	Yes	Our implementation is based on the open-source v LLM system (v0.5.0), with chunked prefill enabled in both our scheduler and the baseline scheduling methods. The evaluation is conducted on a server with a single NVIDIA A100 80GB GPU and 64 AMD EPYC 7313 16-Core Processor cores, with 503 Gi B of memory and running CUDA version 12.3.
Experiment Setup	Yes	We use LLama3-8b-instruct as a serving model with single-GPU. The model is trained over 30 epochs with a batch size of 32, using the Adam W optimizer to control for overfitting. We employ a cosine annealing schedule to reduce the learning rate from 0.01 to 0 gradually. The loss function is Cross Entropy Loss, appropriate for our multi-class classification task. We set the out-of-memory mode to discard jobs and recompute them once memory becomes available.