Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Nonetheless, we show that even a null model that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on Alpaca Eval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of Alpaca Eval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks.
Researcher Affiliation Collaboration Xiaosen Zheng 1,2, Tianyu Pang 1, Chao Du1, Qian Liu1, Jing Jiang 2,3, Min Lin1 1Sea AI Lab, Singapore 2Singapore Management University 3Australian National University EMAIL; EMAIL
Pseudocode Yes Pseudo-code for Null Models class Null Model(): def __init__(self, const_str): # no trainable parameters self.output = const_str def generate(self, instruct): # irrelevant to instructions return self.output
Open Source Code Yes The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.
Open Datasets Yes Our experiments utilize the official evaluation templates associated with different LLM-based evaluations unless stated otherwise. We evaluate our cheating method on Alpaca Eval 2.0 (Li et al., 2023c; Dubois et al., 2024), Arena-Hard-Auto (Li et al., 2024b), and MT-Bench (Zheng et al., 2023) as detailed in Table 1.
Dataset Splits Yes For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively. The full templates and structured responses for Arena-Hard-Auto and MT-Bench are presented in Figures 10 and 11. The effectiveness of our structured response. As mentioned in Section 3, we employ a structured response to facilitate the cheating, which provides a good initial point and could reduce the optimization cost. To further demonstrate the effectiveness of our structured cheating response, we evaluate log p(winner = Null Model) on a sampled subset of the Alpaca Eval 2.0 test instructions using different null responses.
Hardware Specification Yes All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers.
Software Dependencies Yes The targeted auto-annotators include both open-source and closed-source LLMs: Llama-3-8B-Instruct, Llama-3-70B-Instruct (Meta, 2024; Touvron et al., 2023), and GPT-4-1106-Preview (Open AI, 2023). Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage.2 All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers.
Experiment Setup Yes Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage. For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively.