Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Authors: Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency.
Researcher Affiliation Collaboration 1University of California, San Diego, 2Gensee AI Inc.
Pseudocode Yes Algorithm 1 E2 Global Scheduling Algorithm Algorithm 2 GPU Load Cost Calculation Algorithm 3 E2 Local Scheduling Algorithm
Open Source Code Yes Preble is publicly available at https://github.com/Wuk Lab/preble.
Open Datasets Yes We evaluate the Toolbench Guo et al. (2024) dataset, which consists of more than 210k queries that call over 16k unique tools. The workload we utilize is sourced from the ALFWorld Shridhar et al. (2021) dataset and has 7.5k requests. We study the APPS competitive programming dataset Hendrycks et al. (2021), a dataset of programming problems. To study this, we analyze the NEx T-QA benchmark Xiao et al. (2021), which consists of 8.5K questions for 1000 video segments. We evaluate this usage with the Loo GLE dataset Li et al. (2023a), a collection of 776 long documents and over 6.4k questions. To understand LLM usage in the wild, we analyze the recently released Azure LLM Inference Trace Patel et al. (2024).
Dataset Splits No For each workload, we sample enough requests to fulfill the request-per-second (RPS) needs and GPU setup (e.g., a larger GPU or more GPUs can handle more). For experiments other than the ones using the Azure Inference Trace, we set the inter-arrival time using a Poisson distribution with a mean that around the RPS (X-axis in most figures). We then run the experiments until stable state is reached and for a significant length.
Hardware Specification Yes We run our experiments in one of the two environments: a two-server cluster with two NVidia A600 GPUs and one eight NVidia-H100-GPU. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency.
Software Dependencies Yes We implement Preble as a standalone layer on top of slightly modified v LLM Kwon et al. (2023) and SGLang Zheng et al. (2023b), two popular open-source LLM serving systems both supporting single-GPU prefix caching. we evaluate the latest SGLang version (v0.3) Team (2024b) and the latest Flash Infer kernel (v1.6) Team (2024a)
Experiment Setup Yes We capture a recent load history on GPUi with a time window H with a default value of 3 minutes (we test different H lengths and find the results insensitive to it). If the most heavily loaded GPU s load is more than Thbal times higher than the lightest GPU, it shifts load from the former to the latter until their difference is below Thbal. Thbal is configurable and can be deducted from profiling GPU and LLM types. Specifically, we create P (a configurable parameter) priority groups and assign a request to the priority group according to its cached token percentage.