Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU

Authors: Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations demonstrate that Rec OS improves online service performance, reducing latency by up to 68%. Our experiments on multiple RMs show that Rec OS can effectively improve the inference latency. Our algorithm can significantly reduce inference latency under high concurrency. Compared to current service frameworks, Rec OS can achieve an inference latency improvement of up to 68%.
Researcher Affiliation Collaboration 1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China 3Pengcheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes algorithms and methods in prose (e.g., in Section 3 Methodology, 3.4 Multistream Scheduler) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states 'We implemented Rec OS on TS [Olston et al., 2017a], an open-source machine learning service system designed for production environments.' This refers to the base system TS being open-source, not Rec OS itself. No explicit statement or link for the source code of Rec OS is provided.
Open Datasets No We evaluated four representative recommendation models from in-house production: Wn D [Cheng et al., 2016], DIN [Zhou et al., 2018b], DLRM [Naumov et al., 2019], and BST [Chen et al., 2019]. The datasets used for evaluation are implied to be from 'in-house production' and no specific public dataset names, links, or access information are provided.
Dataset Splits No The paper focuses on inference performance under varying concurrency and online traffic conditions, rather than model training. It does not provide information about specific training/test/validation dataset splits.
Hardware Specification Yes We deployed our system on a server equipped with an Intel (R) Xeon (R) Platinum 8352Y CPU and an Nvidia A30 GPU (24 GB HBM2 and 56 SMs available), matching our production environment.
Software Dependencies Yes All code was compiled using GCC and nvcc with the -O3 option. We used CUDA driver version 525 and CUDA Toolkit 12.0.
Experiment Setup Yes Experiments were conducted at five levels of concurrency: 1, 8, 15, 22, and 30. The clients keep sending queries after receiving the response of the previous queries from the server. Besides, we also simulated online traffic to test the performance, focusing on three main types: pulse-type, bimodal, and unimodal traffic distributions.