Consistency Checks for Language Model Forecasters
Authors: Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramer
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters ground truth Brier scores (which are only known in the future). We evaluate a range of forecasters on the datasets described above, for both consistency and ground truth Brier score. |
| Researcher Affiliation | Academia | Daniel Paleka ETH Zurich Abhimanyu Pallavi Sudhir * University of Warwick Alejandro Alvarez Independent Vineeth Bhat IIIT Hyderabad Adam Shen Columbia University Evan Wang Cornell University Florian Tram er ETH Zurich |
| Pseudocode | Yes | Algorithm 1 Arbitrage Forecaster algorithm: F C |
| Open Source Code | Yes | We release the full code 1 and the datasets 2 used in the paper. 1https://github.com/dpaleka/consistency-forecasting |
| Open Datasets | Yes | We release the full code 1 and the datasets 2 used in the paper. 2https://huggingface.co/datasets/dpaleka/ccflmf |
| Dataset Splits | Yes | We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or detailed cloud instance types. It mentions LLMs but not the underlying hardware for their operation. |
| Software Dependencies | Yes | We use the Instructor library Liu (2024) to make the output conform to a specific Pydantic model that has a prob field forced to be a float between 0 and 1. ... Liu (2024). Instructor: Structured LLM Outputs, May 2024. URL https://github.com/jxnl/instructor. Version 1.4.1. |
| Experiment Setup | Yes | We run each of these forecasters on 5000 tuples in total (for each of the 10 checks, we use 200 tuples from scraped questions and 300 from News API questions), except for o1-preview, which we test on 50 tuples per check only due to cost constraints. ... You are an informed and well-calibrated forecaster. I need you to give me your best probability estimate for the following sentence or question resolving YES. Your answer should be a float between 0 and 1, with nothing else in your response. Question: {question} |