Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Nonetheless, we show that even a null model that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on Alpaca Eval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of Alpaca Eval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. |
| Researcher Affiliation | Collaboration | Xiaosen Zheng 1,2, Tianyu Pang 1, Chao Du1, Qian Liu1, Jing Jiang 2,3, Min Lin1 1Sea AI Lab, Singapore 2Singapore Management University 3Australian National University EMAIL; EMAIL |
| Pseudocode | Yes | Pseudo-code for Null Models class Null Model(): def __init__(self, const_str): # no trainable parameters self.output = const_str def generate(self, instruct): # irrelevant to instructions return self.output |
| Open Source Code | Yes | The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks. |
| Open Datasets | Yes | Our experiments utilize the official evaluation templates associated with different LLM-based evaluations unless stated otherwise. We evaluate our cheating method on Alpaca Eval 2.0 (Li et al., 2023c; Dubois et al., 2024), Arena-Hard-Auto (Li et al., 2024b), and MT-Bench (Zheng et al., 2023) as detailed in Table 1. |
| Dataset Splits | Yes | For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively. The full templates and structured responses for Arena-Hard-Auto and MT-Bench are presented in Figures 10 and 11. The effectiveness of our structured response. As mentioned in Section 3, we employ a structured response to facilitate the cheating, which provides a good initial point and could reduce the optimization cost. To further demonstrate the effectiveness of our structured cheating response, we evaluate log p(winner = Null Model) on a sampled subset of the Alpaca Eval 2.0 test instructions using different null responses. |
| Hardware Specification | Yes | All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers. |
| Software Dependencies | Yes | The targeted auto-annotators include both open-source and closed-source LLMs: Llama-3-8B-Instruct, Llama-3-70B-Instruct (Meta, 2024; Touvron et al., 2023), and GPT-4-1106-Preview (Open AI, 2023). Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage.2 All experiments were conducted on 8 NVIDIA A100 (40G) GPUs within a few hours using vLLM as the inference engine, and the tokenization template was sourced from Hugging Face tokenizers. |
| Experiment Setup | Yes | Each LLM uses its default generation configuration with a temperature setting of 0.0. For Llama-3 auto-annotators, we use 4-bit quantized versions to reduce GPU memory usage. For RS, we set the number of training instructions N as 10, 8, and 4, the number of optimization steps T as 384, 96 and 64 for Alpaca Eval 2.0, Arena-Hard-Auto and MT-Bench, respectively. |