SysBench: Can LLMs Follow System Message?
Authors: Yanzhao Qin, Tao Zhang, Yanjun Shen, Wenjing Luo, sunhaoze, Yan Zhang, Yujing Qiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, Bin CUI
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive evaluation across various existing LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research. |
| Researcher Affiliation | Collaboration | 1School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University 2Baichuan Inc. 3Center for Machine Learning Research, Peking University 4Institute of Computational Social Science, Peking University (Qingdao) |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described through text and mathematical formulas. |
| Open Source Code | Yes | Source code is available at https://github.com/PKU-Baichuan-MLSystem Lab/Sys Bench. |
| Open Datasets | No | The paper describes the creation of a new dataset for the Sys Bench benchmark, stating: 'We construct a high-quality dataset focusing on system message following evaluation, which includes 500 system messages, each corresponding to 5 turns of user conversations, covering a variety of application scenarios.' However, no explicit link, DOI, or repository for the dataset itself is provided; the given link is for 'Source code'. |
| Dataset Splits | Yes | The data are categorized into aligned and misaligned instructions, as well as multi-turn dependent and multi-turn parallel dialogues, providing more perspectives for analyzing model performance. As shown in Table 1, the Sys Bench dataset includes a total of 500 system messages, each of them includes 5 rounds of user conversations. The indicators across multi-turn conversation categories are: Parallel sessions (144), Dependent sessions (356), Aligned instructions (1951), Misaligned instructions (549). |
| Hardware Specification | No | The paper mentions evaluating various LLMs but does not provide any specific hardware details (like GPU/CPU models or cloud instance types) used for conducting these evaluations or running inference for the open-source models. |
| Software Dependencies | No | The paper mentions using specific LLMs (e.g., GPT-4o as a verifier) and setting inference parameters, but it does not specify any ancillary software dependencies such as programming languages, libraries, or frameworks with their version numbers. |
| Experiment Setup | Yes | We select GPT-4o as the model-based verifier in verification stage due to its demonstrated superior quality-price ratio, and set temperature to 0 to ensure deterministic output. During the generation stage, we maintain all inference parameters at their default settings across all scenarios. |