AutoEval Done Right: Using Synthetic Data for Model Evaluation

Authors: Pierre Boyeau, Anastasios Nikolas Angelopoulos, Tianle Li, Nir Yosef, Jitendra Malik, Michael I. Jordan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied the described methodology applies for evaluating computer vision models. We considered five trained computer vision models (Res Net-18, Res Net-34, Res Net-50, Res Net-101, and Res Net-152) optimized over the training set of Image Net and sourced from Py Torch (Paszke et al., 2019). We considered the task of estimating their accuracy on the validation set of Image Net in a low-data regime, using a subset of labeled data points. The mean-squared error of our estimates of the model accuracies improved over the classical baseline (Figure 1a). Both PPI and PPI++ had lower mean-squared errors than the baseline, no matter the size of the labeled set. We also used Auto Eval to rank regression models, and more specifically, protein fitness prediction models. We observed that the BT coefficients were better estimated by PPI++ than by the classical approach, hinting that the point estimates of Auto Eval are more accurate (Figure 4a).
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA 2Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel 3Inria, Ecole Normale Sup erieure, Paris, France.
Pseudocode Yes A. Code snippets This section provides code snippets to produce confidence intervals and point estimates for model accuracy and pairwise comparisons with the existing Python package ppi py (Angelopoulos et al., 2023a). Snippet 1: Python code to produce CIs and point estimates for model accuracy. The variable meanings are explained in the code comments. Snippet 2: Python code to produce CIs for the Bradley-Terry coefficients (without multiplicity correction). The variable meanings are explained in the code comments.
Open Source Code Yes All code used to reproduce this work is available as supplementary materials available on Open Review. We refer the reader to Supplement C for details on the experimental setup. Tools to apply the described methodology for model evaluation are available as a Python package, available at https://github.com/aangelopoulos/ppi_py. Code to reproduce the experiments is available at https://github.com/Pierre Boyeau/autoeval.
Open Datasets Yes We considered the task of estimating their accuracy on the validation set of Image Net in a low-data regime, using a subset of labeled data points. We applied Auto Eval on Protein Gym (Notin et al., 2023), which gathers several assays containing both experimental fitness measurements, used as ground-truth labels, and predicted fitness scores from various fitness predictive models. We evaluated our approach on the Chatbot Arena project (Chiang et al., 2024). We first extracted 16K observations from the Chatbot Arena dataset, in which a total of 20 recent LLMs were compared (Table S1).
Dataset Splits Yes To reflect a low-data regime, we randomly sampled a small number n of observations to be used as labeled data points available for these approaches. The rest of the observations in the validation data were used as unlabeled data points for PPI and PPI++. obtained metrics are averaged across 250 random splits of the validation data into labeled and unlabeled data. We first extracted 16K observations from the Chatbot Arena dataset, in which a total of 20 recent LLMs were compared (Table S1). We focused on scenarios where only a few of the 16K human preferences were available
Hardware Specification Yes All Auto Eval experiments were run on a workstation with 12th generation Intel (R)Core (TM) i9-12900KF, 128GB of RAM, and on a compute cluster relying on CPU nodes with four cores. This experiment was run on a workstation with an Nvidia RTX 3090 GPU, 128GB RAM, and an i9-12900KF CPU.
Software Dependencies No The paper mentions using 'Python package ppi py' and 'Jax' but does not specify their version numbers. Therefore, it does not provide a reproducible description of ancillary software with specific version numbers.
Experiment Setup Yes In all experiments, we randomly split the data into labeled and unlabeled sets 250 times, and computed all point estimates in the main text and in this supplementary material as the average estimate over these splits. To rank models with the different estimators, we computed 90% confidence intervals for the different approaches after Bonferroni correction. Models with overlapping confidence intervals were assigned the same rank.