reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

Authors: David Salinas, Omar Swelam, Frank Hutter

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize openweight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https: //github.com/geoalgo/judgetuning.
Researcher Affiliation	Academia	1University of Freiburg 2ELLIS Institute Tübingen. Correspondence to: David Salinas <EMAIL>.
Pseudocode	No	The paper describes methods and processes like multi-fidelity multi-objective optimization and non-dominated sort, but it does not present these as structured pseudocode or algorithm blocks. It explains the concepts in prose.
Open Source Code	Yes	The code to reproduce our experiments is available at this repository https: //github.com/geoalgo/judgetuning.
Open Datasets	Yes	We consider two datasets that contains prompts, model completions and judge annotations for a grid of prompts and model pairs. The first one is Alpaca-Eval which contains 47 models completions on 805 prompts (Li et al., 2023). The second one is Arena-Hard which contains the completions on 500 instructions for 57 models (Li et al., 2024). In both cases, we select the 26 models that also appear in Chatbot Arena in order to be able to compute how well judge configurations approximate human judgement. We use the LMSys dataset (lin Chiang et al., 2024) which contains 51734 battles and allows to measure the human-agreement of a given judge configuration.
Dataset Splits	Yes	This gives 6548 instruction which we split randomly into 3548 validation instructions and 3000 test instructions. All model selection (e.g. non-dominated sort) is done only using validation instructions and only the best models on the validation set are evaluated on the test set.
Hardware Specification	Yes	To generate inference with open models, we host the models locally using VLLM on L40 GPUs for models up to 32B parameters and on H100 GPUs for models with more than 70B parameters. Given that we use two clusters with two different job queues, we favored using synchronous successful halving rather than an asynchronous approach such as (Schmucker et al., 2021). We submitted all the 4480 configurations of the first fidelity with 400 instructions, then applied non-dominated sort and submitted the top 1200 configurations with 1200 instructions before finally submitting the top 400 configurations with 3548 instructions.
Software Dependencies	No	The paper mentions using VLLM for hosting models, but it does not specify a version number for VLLM or any other key software dependencies like programming languages or deep learning frameworks.
Experiment Setup	Yes	For the LLM model, we search among 7 open-weights options: Llama3.1 (8B and 70B), Qwen2.5 (7B, 27B and 70B) and Gemma2 (9B, 27B). All models with more than 9B parameters are quantized with half-precision. We also search for the LLM temperature in [0.0, 0.01, 0.1, 1.0] and whether to average predictions when considering two possible orders or using just a single order. We now describe how we parametrize different prompt options and we illustrate one such option in Fig. 3. Output format. When prompting a LLM judge, we must be able to parse its output into a preference. We consider the following options where the judge outputs: best-model-identifier: a letter corresponding to the best assistant as in Li et al. (2023), likert: a likert scale ..., pair: a score for both assistants in [0-10] ..., preference: a score in [0, 1] ..., multi: the average score for 5 criteria .... Provide answer or other information. ... confidence: its confidence on its preference, answer: its own answer to the instruction ..., explanation: its explanation on the given preference ...