reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU

Authors: Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations demonstrate that Rec OS improves online service performance, reducing latency by up to 68%. Our experiments on multiple RMs show that Rec OS can effectively improve the inference latency. Our algorithm can significantly reduce inference latency under high concurrency. Compared to current service frameworks, Rec OS can achieve an inference latency improvement of up to 68%.
Researcher Affiliation	Collaboration	1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China 2Meituan, Beijing, China 3Pengcheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes algorithms and methods in prose (e.g., in Section 3 Methodology, 3.4 Multistream Scheduler) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states 'We implemented Rec OS on TS [Olston et al., 2017a], an open-source machine learning service system designed for production environments.' This refers to the base system TS being open-source, not Rec OS itself. No explicit statement or link for the source code of Rec OS is provided.
Open Datasets	No	We evaluated four representative recommendation models from in-house production: Wn D [Cheng et al., 2016], DIN [Zhou et al., 2018b], DLRM [Naumov et al., 2019], and BST [Chen et al., 2019]. The datasets used for evaluation are implied to be from 'in-house production' and no specific public dataset names, links, or access information are provided.
Dataset Splits	No	The paper focuses on inference performance under varying concurrency and online traffic conditions, rather than model training. It does not provide information about specific training/test/validation dataset splits.
Hardware Specification	Yes	We deployed our system on a server equipped with an Intel (R) Xeon (R) Platinum 8352Y CPU and an Nvidia A30 GPU (24 GB HBM2 and 56 SMs available), matching our production environment.
Software Dependencies	Yes	All code was compiled using GCC and nvcc with the -O3 option. We used CUDA driver version 525 and CUDA Toolkit 12.0.
Experiment Setup	Yes	Experiments were conducted at five levels of concurrency: 1, 8, 15, 22, and 30. The clients keep sending queries after receiving the response of the previous queries from the server. Besides, we also simulated online traffic to test the performance, focusing on three main types: pulse-type, bimodal, and unimodal traffic distributions.