Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU
Authors: Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations demonstrate that Rec OS improves online service performance, reducing latency by up to 68%. Our experiments on multiple RMs show that Rec OS can effectively improve the inference latency. Our algorithm can significantly reduce inference latency under high concurrency. Compared to current service frameworks, Rec OS can achieve an inference latency improvement of up to 68%. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China 3Pengcheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes algorithms and methods in prose (e.g., in Section 3 Methodology, 3.4 Multistream Scheduler) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We implemented Rec OS on TS [Olston et al., 2017a], an open-source machine learning service system designed for production environments.' This refers to the base system TS being open-source, not Rec OS itself. No explicit statement or link for the source code of Rec OS is provided. |
| Open Datasets | No | We evaluated four representative recommendation models from in-house production: Wn D [Cheng et al., 2016], DIN [Zhou et al., 2018b], DLRM [Naumov et al., 2019], and BST [Chen et al., 2019]. The datasets used for evaluation are implied to be from 'in-house production' and no specific public dataset names, links, or access information are provided. |
| Dataset Splits | No | The paper focuses on inference performance under varying concurrency and online traffic conditions, rather than model training. It does not provide information about specific training/test/validation dataset splits. |
| Hardware Specification | Yes | We deployed our system on a server equipped with an Intel (R) Xeon (R) Platinum 8352Y CPU and an Nvidia A30 GPU (24 GB HBM2 and 56 SMs available), matching our production environment. |
| Software Dependencies | Yes | All code was compiled using GCC and nvcc with the -O3 option. We used CUDA driver version 525 and CUDA Toolkit 12.0. |
| Experiment Setup | Yes | Experiments were conducted at five levels of concurrency: 1, 8, 15, 22, and 30. The clients keep sending queries after receiving the response of the previous queries from the server. Besides, we also simulated online traffic to test the performance, focusing on three main types: pulse-type, bimodal, and unimodal traffic distributions. |