reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Authors: Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency.
Researcher Affiliation	Collaboration	1University of California, San Diego, 2Gensee AI Inc.
Pseudocode	Yes	Algorithm 1 E2 Global Scheduling Algorithm Algorithm 2 GPU Load Cost Calculation Algorithm 3 E2 Local Scheduling Algorithm
Open Source Code	Yes	Preble is publicly available at https://github.com/Wuk Lab/preble.
Open Datasets	Yes	We evaluate the Toolbench Guo et al. (2024) dataset, which consists of more than 210k queries that call over 16k unique tools. The workload we utilize is sourced from the ALFWorld Shridhar et al. (2021) dataset and has 7.5k requests. We study the APPS competitive programming dataset Hendrycks et al. (2021), a dataset of programming problems. To study this, we analyze the NEx T-QA benchmark Xiao et al. (2021), which consists of 8.5K questions for 1000 video segments. We evaluate this usage with the Loo GLE dataset Li et al. (2023a), a collection of 776 long documents and over 6.4k questions. To understand LLM usage in the wild, we analyze the recently released Azure LLM Inference Trace Patel et al. (2024).
Dataset Splits	No	For each workload, we sample enough requests to fulfill the request-per-second (RPS) needs and GPU setup (e.g., a larger GPU or more GPUs can handle more). For experiments other than the ones using the Azure Inference Trace, we set the inter-arrival time using a Poisson distribution with a mean that around the RPS (X-axis in most figures). We then run the experiments until stable state is reached and for a significant length.
Hardware Specification	Yes	We run our experiments in one of the two environments: a two-server cluster with two NVidia A600 GPUs and one eight NVidia-H100-GPU. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5 to 14.5 on average latency and 2 to 10 on p99 latency.
Software Dependencies	Yes	We implement Preble as a standalone layer on top of slightly modified v LLM Kwon et al. (2023) and SGLang Zheng et al. (2023b), two popular open-source LLM serving systems both supporting single-GPU prefix caching. we evaluate the latest SGLang version (v0.3) Team (2024b) and the latest Flash Infer kernel (v1.6) Team (2024a)
Experiment Setup	Yes	We capture a recent load history on GPUi with a time window H with a default value of 3 minutes (we test different H lengths and find the results insensitive to it). If the most heavily loaded GPU s load is more than Thbal times higher than the lightest GPU, it shifts load from the former to the latter until their difference is below Thbal. Thbal is configurable and can be deducted from profiling GPU and LLM types. Specifically, we create P (a configurable parameter) priority groups and assign a request to the priority group according to its cached token percentage.