Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Authors: Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency. |
| Researcher Affiliation | Collaboration | 1University of California, San Diego, 2Gensee AI Inc. |
| Pseudocode | Yes | Algorithm 1 E2 Global Scheduling Algorithm Algorithm 2 GPU Load Cost Calculation Algorithm 3 E2 Local Scheduling Algorithm |
| Open Source Code | Yes | Preble is publicly available at https://github.com/Wuk Lab/preble. |
| Open Datasets | Yes | We evaluate the Toolbench Guo et al. (2024) dataset, which consists of more than 210k queries that call over 16k unique tools. The workload we utilize is sourced from the ALFWorld Shridhar et al. (2021) dataset and has 7.5k requests. We study the APPS competitive programming dataset Hendrycks et al. (2021), a dataset of programming problems. To study this, we analyze the NEx T-QA benchmark Xiao et al. (2021), which consists of 8.5K questions for 1000 video segments. We evaluate this usage with the Loo GLE dataset Li et al. (2023a), a collection of 776 long documents and over 6.4k questions. To understand LLM usage in the wild, we analyze the recently released Azure LLM Inference Trace Patel et al. (2024). |
| Dataset Splits | No | For each workload, we sample enough requests to fulfill the request-per-second (RPS) needs and GPU setup (e.g., a larger GPU or more GPUs can handle more). For experiments other than the ones using the Azure Inference Trace, we set the inter-arrival time using a Poisson distribution with a mean that around the RPS (X-axis in most figures). We then run the experiments until stable state is reached and for a significant length. |
| Hardware Specification | Yes | We run our experiments in one of the two environments: a two-server cluster with two NVidia A600 GPUs and one eight NVidia-H100-GPU. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency. |
| Software Dependencies | Yes | We implement Preble as a standalone layer on top of slightly modified v LLM Kwon et al. (2023) and SGLang Zheng et al. (2023b), two popular open-source LLM serving systems both supporting single-GPU prefix caching. we evaluate the latest SGLang version (v0.3) Team (2024b) and the latest Flash Infer kernel (v1.6) Team (2024a) |
| Experiment Setup | Yes | We capture a recent load history on GPUi with a time window H with a default value of 3 minutes (we test different H lengths and find the results insensitive to it). If the most heavily loaded GPU s load is more than Thbal times higher than the lightest GPU, it shifts load from the former to the latter until their difference is below Thbal. Thbal is configurable and can be deducted from profiling GPU and LLM types. Specifically, we create P (a configurable parameter) priority groups and assign a request to the priority group according to its cached token percentage. |