reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SysBench: Can LLMs Follow System Message?

Authors: Yanzhao Qin, Tao Zhang, Yanjun Shen, Wenjing Luo, sunhaoze, Yan Zhang, Yujing Qiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, Bin CUI

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we conduct extensive evaluation across various existing LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research.
Researcher Affiliation	Collaboration	1School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University 2Baichuan Inc. 3Center for Machine Learning Research, Peking University 4Institute of Computational Social Science, Peking University (Qingdao)
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described through text and mathematical formulas.
Open Source Code	Yes	Source code is available at https://github.com/PKU-Baichuan-MLSystem Lab/Sys Bench.
Open Datasets	No	The paper describes the creation of a new dataset for the Sys Bench benchmark, stating: 'We construct a high-quality dataset focusing on system message following evaluation, which includes 500 system messages, each corresponding to 5 turns of user conversations, covering a variety of application scenarios.' However, no explicit link, DOI, or repository for the dataset itself is provided; the given link is for 'Source code'.
Dataset Splits	Yes	The data are categorized into aligned and misaligned instructions, as well as multi-turn dependent and multi-turn parallel dialogues, providing more perspectives for analyzing model performance. As shown in Table 1, the Sys Bench dataset includes a total of 500 system messages, each of them includes 5 rounds of user conversations. The indicators across multi-turn conversation categories are: Parallel sessions (144), Dependent sessions (356), Aligned instructions (1951), Misaligned instructions (549).
Hardware Specification	No	The paper mentions evaluating various LLMs but does not provide any specific hardware details (like GPU/CPU models or cloud instance types) used for conducting these evaluations or running inference for the open-source models.
Software Dependencies	No	The paper mentions using specific LLMs (e.g., GPT-4o as a verifier) and setting inference parameters, but it does not specify any ancillary software dependencies such as programming languages, libraries, or frameworks with their version numbers.
Experiment Setup	Yes	We select GPT-4o as the model-based verifier in verification stage due to its demonstrated superior quality-price ratio, and set temperature to 0 to ensure deterministic output. During the generation stage, we maintain all inference parameters at their default settings across all scenarios.