reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Authors: Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release Live Bench... We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. Live Bench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis... In this section, first we describe our experimental setup and present full results for 40 LLMs on all 18 tasks of Live Bench. Next, we give an empirical comparison of Live Bench to existing prominent LLM benchmarks, and finally, we present ablation studies.
Researcher Affiliation	Collaboration	Colin White 1, Samuel Dooley 1, Manley Roberts 1, Arka Pal 1, Benjamin Feuer2, Siddhartha Jain3, Ravid Shwartz-Ziv2, Neel Jain4, Khalid Saifullah4, Sreemanti Dey1, Shubh-Agrawal1, Sandeep Singh Sandha1, Siddartha Naidu1, Chinmay Hegde2, Yann Le Cun2, Tom Goldstein4, Willie Neiswanger5, Micah Goldblum6 1 Abacus.AI, 2 NYU, 3 Nvidia, 4 UMD, 5 USC, 6 Columbia
Pseudocode	No	The paper describes the methodology for creating and evaluating the Live Bench benchmark in narrative text and tables. It does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	Yes	We release all questions, code, and model answers. Our codebase is available at https://github.com/ livebench/livebench, and our leaderboard is available at https://livebench.ai. Our work is fully reproducible: we open-source the leaderboard, all questions, all code to run API and open-source models, all model outputs for 40 models, and all code to score the models. In other words, every part of the project is available publicly: https://livebench.ai/.
Open Datasets	Yes	Live Bench contains questions that are based on recently-released math competitions, ar Xiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval... Data Analysis: three tasks using recent datasets from Kaggle and Socrata... Coding: code generation questions from recent Leetcode and At Coder questions (via Live Code Bench (Jain et al., 2024))... Instruction Following: four tasks to paraphrase, simplify, summarize, or generate stories about recent new articles from The Guardian (Guardian Media Group, 1821)... Language Comprehension: a typo-fixing task... from recent Ar Xiv abstracts... a movie synopsis unscrambling task for recent movies on IMDb and Wikipedia
Dataset Splits	No	The paper introduces a benchmark for evaluating pre-existing LLMs and describes the questions and tasks within this benchmark. It does not describe training a model by the authors themselves, nor does it provide dataset splits (training/test/validation) for any data used in such a training process within the paper's experiments. The 'experiments' consist of evaluating various models on the Live Bench benchmark tasks, which itself acts as a test set.
Hardware Specification	No	The paper states: "We run all open-source models with bfloat16." This specifies a data type for running models but does not provide details about the specific hardware (e.g., GPU models, CPU types, or memory) used for these operations.
Software Dependencies	Yes	All models run with their respective templates from our updated version of Fast Chat (Zheng et al., 2024)... We judge the correctness of answers as in the Eleuther AI Eval Harness (Gao et al., 2021) using Sympy (Meurer et al., 2017)... We use the popular library Pandas to perform all of our conversions to and from text strings... we instead apply a fuzzy-match using difflib (Team, 2008)
Experiment Setup	Yes	For all models and tasks, we perform single-turn evaluation with temperature 0, unless otherwise noted in the model card. All models run with their respective templates from our updated version of Fast Chat (Zheng et al., 2024). We run all open-source models with bfloat16. When running new models, we take care to set up its hyperparameters and chat template as in the model s example code, and we also double check the outputs to make sure that the inference, as well as our automated parsing functions, are working correctly and fairly.