reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Consistency Checks for Language Model Forecasters

Authors: Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters ground truth Brier scores (which are only known in the future). We evaluate a range of forecasters on the datasets described above, for both consistency and ground truth Brier score.
Researcher Affiliation	Academia	Daniel Paleka ETH Zurich Abhimanyu Pallavi Sudhir * University of Warwick Alejandro Alvarez Independent Vineeth Bhat IIIT Hyderabad Adam Shen Columbia University Evan Wang Cornell University Florian Tram er ETH Zurich
Pseudocode	Yes	Algorithm 1 Arbitrage Forecaster algorithm: F C
Open Source Code	Yes	We release the full code 1 and the datasets 2 used in the paper. 1https://github.com/dpaleka/consistency-forecasting
Open Datasets	Yes	We release the full code 1 and the datasets 2 used in the paper. 2https://huggingface.co/datasets/dpaleka/ccflmf
Dataset Splits	Yes	We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or detailed cloud instance types. It mentions LLMs but not the underlying hardware for their operation.
Software Dependencies	Yes	We use the Instructor library Liu (2024) to make the output conform to a specific Pydantic model that has a prob field forced to be a float between 0 and 1. ... Liu (2024). Instructor: Structured LLM Outputs, May 2024. URL https://github.com/jxnl/instructor. Version 1.4.1.
Experiment Setup	Yes	We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints. ... You are an informed and well-calibrated forecaster. I need you to give me your best probability estimate for the following sentence or question resolving YES. Your answer should be a float between 0 and 1, with nothing else in your response. Question: {question}