DON’T STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS
Authors: Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our refined predictions from layer embeddings achieve 2.66x lower mean absolute error compared to BERT predictions from sequence prompts. TRAIL achieves 1.66x to 2.01x lower mean latency on the Alpaca dataset and 1.76x to 24.07x lower mean time to the first token compared to the state-of-the-art serving system. |
| Researcher Affiliation | Academia | Rana Shahout Harvard University Eran Malach Harvard University Chunwei Liu MIT Weifan Jiang Harvard University Minlan Yu Harvard University Michael Mitzenmacher Harvard University |
| Pseudocode | No | The paper discusses algorithmic concepts like SPRPT with limited preemption and provides a closed-form formula for it, but it does not present any structured pseudocode or algorithm blocks with step-by-step procedures in the main text or appendices. |
| Open Source Code | No | The paper states: "Our implementation is based on the open-source v LLM system (v0.5.0)", referring to a third-party system used, not their own code release. No explicit statement of code release or a link to the authors' own source code repository is provided. |
| Open Datasets | Yes | The workload is generated using the Alpaca dataset (Taori et al., 2023), derived from open-source conversational exchanges and originally used to fine-tune the Alpaca model. We sample 10k unique prompts from the dataset for model serving, distinct from those used to train the length predictor. Citing: Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. |
| Dataset Splits | Yes | The dataset was split such that 75% of the prompts were used for training and the remaining 25% for evaluation. |
| Hardware Specification | Yes | The evaluation is conducted on a server with a single NVIDIA A100 80GB GPU and 64 AMD EPYC 7313 16-Core Processor cores, with 503 Gi B of memory and running CUDA version 12.3. For testing multi-GPU settings, we used a machine with dual AMD EPYC 7313 CPUs (16 cores per CPU, totaling 64 threads), 503 GB of RAM, and two NVIDIA A100 GPUs with 80 GB memory each connected via NVLink. |
| Software Dependencies | Yes | Our implementation is based on the open-source v LLM system (v0.5.0), with chunked prefill enabled in both our scheduler and the baseline scheduling methods. The evaluation is conducted on a server with a single NVIDIA A100 80GB GPU and 64 AMD EPYC 7313 16-Core Processor cores, with 503 Gi B of memory and running CUDA version 12.3. |
| Experiment Setup | Yes | We use LLama3-8b-instruct as a serving model with single-GPU. The model is trained over 30 epochs with a batch size of 32, using the Adam W optimizer to control for overfitting. We employ a cosine annealing schedule to reduce the learning rate from 0.01 to 0 gradually. The loss function is Cross Entropy Loss, appropriate for our multi-class classification task. We set the out-of-memory mode to discard jobs and recompute them once memory becomes available. |