reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Nonetheless, we show that even a null model that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on Alpaca Eval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of Alpaca Eval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks.
Researcher Affiliation	Collaboration	Xiaosen Zheng 1,2, Tianyu Pang 1, Chao Du1, Qian Liu1, Jing Jiang 2,3, Min Lin1 1Sea AI Lab, Singapore 2Singapore Management University 3Australian National University EMAIL; EMAIL
Pseudocode	Yes	Pseudo-code for Null Models class Null Model(): def __init__(self, const_str): # no trainable parameters self.output = const_str def generate(self, instruct): # irrelevant to instructions return self.output
Open Source Code	Yes	The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.
Open Datasets	Yes	Our experiments utilize the official evaluation templates associated with different LLM-based evaluations unless stated otherwise. We evaluate our cheating method on Alpaca Eval 2.0 (Li et al., 2023c; Dubois et al., 2024), Arena-Hard-Auto (Li et al., 2024b), and MT-Bench (Zheng et al., 2023) as detailed in Table 1.
Dataset Splits	Yes	For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively. The full templates and structured responses for Arena-Hard-Auto and MT-Bench are presented in Figures 10 and 11. The effectiveness of our structured response. As mentioned in Section 3, we employ a structured response to facilitate the cheating, which provides a good initial point and could reduce the optimization cost. To further demonstrate the effectiveness of our structured cheating response, we evaluate log p(winner = Null Model) on a sampled subset of the Alpaca Eval 2.0 test instructions using different null responses.
Hardware Specification	Yes	All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers.
Software Dependencies	Yes	The targeted auto-annotators include both open-source and closed-source LLMs: Llama-3-8B-Instruct, Llama-3-70B-Instruct (Meta, 2024; Touvron et al., 2023), and GPT-4-1106-Preview (Open AI, 2023). Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage.2 All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers.
Experiment Setup	Yes	Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage. For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively.