Consistency Checks for Language Model Forecasters

Authors: Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters ground truth Brier scores (which are only known in the future). We evaluate a range of forecasters on the datasets described above, for both consistency and ground truth Brier score.
Researcher Affiliation Academia Daniel Paleka ETH Zurich Abhimanyu Pallavi Sudhir * University of Warwick Alejandro Alvarez Independent Vineeth Bhat IIIT Hyderabad Adam Shen Columbia University Evan Wang Cornell University Florian Tram er ETH Zurich
Pseudocode Yes Algorithm 1 Arbitrage Forecaster algorithm: F C
Open Source Code Yes We release the full code 1 and the datasets 2 used in the paper. 1https://github.com/dpaleka/consistency-forecasting
Open Datasets Yes We release the full code 1 and the datasets 2 used in the paper. 2https://huggingface.co/datasets/dpaleka/ccflmf
Dataset Splits Yes We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or detailed cloud instance types. It mentions LLMs but not the underlying hardware for their operation.
Software Dependencies Yes We use the Instructor library Liu (2024) to make the output conform to a specific Pydantic model that has a prob field forced to be a float between 0 and 1. ... Liu (2024). Instructor: Structured LLM Outputs, May 2024. URL https://github.com/jxnl/instructor. Version 1.4.1.
Experiment Setup Yes We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints. ... You are an informed and well-calibrated forecaster. I need you to give me your best probability estimate for the following sentence or question resolving YES. Your answer should be a float between 0 and 1, with nothing else in your response. Question: {question}