reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions

Authors: Zhun Deng, Thomas P Zollo, Benjamin Eyre, Amogh Inamdar, David Madras, Richard Zemel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now evaluate the empirical performance of Qu Est across two categories of tasks: (1) research settings where expensive experimental data is combined with predictions from an ML model, and (2) LLM auto-evaluation settings where a large, more expensive LLM supplies a small number of high-quality labels (treated as observed ), while a cheaper model provides predictions (treated as imputed ) for the majority of data. In each experiment, we have both observed and imputed labels for all instances, allowing us to gauge the estimation error and interval coverage of different methods with respect to the true quantity.
Researcher Affiliation	Collaboration	1UNC at Chapel Hill 2Columbia University 3Google Deepmind. Correspondence to: Zhun Deng <EMAIL>.
Pseudocode	No	The paper describes the Qu Est framework and its methods using mathematical formulations, theorems, and derivations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	No	Our code will be made public upon release of this paper.
Open Datasets	Yes	We perform our initial experiment using data from 3 publicly available datasets. Poverty Map is a dataset containing satellite imagery and socioeconomic data, used for estimating poverty metrics and wealth distribution across regions, typically in developing nations (Yeh et al., 2020; Koh et al., 2021). In Gene Expression, the goal is to predict the level of gene expression caused by some regulatory DNA (Vaishnav et al., 2022; Angelopoulos et al., 2023). Using the Opinion QA dataset... For our adversarial prompts, we use 20000 prompts randomly sampled from the red-teaming split of the Anthropic RLHF dataset (Ganguli et al., 2022). We derive news articles from the XSum dataset (Narayan et al., 2018).
Dataset Splits	Yes	We run 2000 trials each with varying numbers of randomly sampled observed data (100, 200, 500, 1000) combined with 2000 randomly sampled imputed data... We run 1000 trials with all models, varying numbers of observed inputs, and 2000 imputed inputs... We examine a low label setting where only 100 observed examples and 10,000 imputed examples are available.
Hardware Specification	No	All experiments using Llama or other open-source models were run at Columbia University.
Software Dependencies	No	Our 8 candidate LLMs used to produce responses are taken from the Huggingface model repository: Phi 3 mini 4k instruct, Mistral 7B Instruct v0.2, Mistral 7B Instruct v0.3, Llama 2 7b chat hf, Meta Llama 3 8B Instruct, Llama 3.1 8B Instruct, gemma 2 9b it, Qwen2 7B Instruct. This lists models, which are software, but not the ancillary software dependencies (e.g., Python, PyTorch, HuggingFace Transformers library) with versions that would be needed to run these models and replicate the experiments.
Experiment Setup	Yes	Responses are generated with a temperature of 0.75, using the system prompt: You are a helpful assistant. Answer the question as fully as possible... We score the outputs from each model for toxicity on a scale from 0 (least toxic) to 1 (most toxic)... We use 20 in-context examples using other question/answer pairs from the same person... Our prompt template for the relevance metric is presented below, with the templates for other metrics differing in their descriptions of the metric and evaluation steps.