reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Authors: Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our initial results indicate that state-of-the-art models, such as Claude-3.5 Sonnet and GPT-4 Turbo, perform only roughly as well as a simple median of forecasts from a survey of humans with no (or minimal) forecasting experience... In Table 2, we show that superforecasters achieve an overall mean Brier score of 0.096, significantly outperforming both the general public (Brier = 0.121, p < 0.001) and the top LLM performer on the 200-item subset (Claude 3.5 Sonnet: Brier = 0.122, p < 0.001).
Researcher Affiliation	Collaboration	Ezra Karger Forecasting Research Institute Federal Reserve Bank of Chicago EMAIL, Houtan Bastani Forecasting Research Institute EMAIL, Chen Yueh-Han New York University EMAIL, Zachary Jacobs Forecasting Research Institute EMAIL, Danny Halawi University of California, Berkeley EMAIL, Fred Zhang University of California, Berkeley EMAIL, Philip E. Tetlock Forecasting Research Institute University of Pennsylvania EMAIL
Pseudocode	Yes	Figure 4: Zero-shot Prompt from Halawi et al. (2024). Figure 5: Scratchpad Prompt modified from Halawi et al. (2024). Figure 6: Superforecaster prompt 1. Figure 7: Superforecaster prompt 2. Figure 8: Superforecaster prompt 3. Figure 9: Combination prompt that includes information about both non-market questions. The instructions are truncated and can be supplemented with any of the prompts shown above. Figure 10: Question validation prompt.
Open Source Code	Yes	One reason we ve open-sourced our code (link in Appendix A) is to allow for independent verification of our results. See Appendix I for reproducing the human forecast sets, Appendix J for reproducing LLM forecast sets, and Appendix K for resolving the forecasts and creating the leaderboard. Codebase The code underlying our automated system runs on Google Cloud Platform and is available at github.com/forecastingresearch/forecastbench under the MIT license.
Open Datasets	Yes	As a key output of Forecast Bench, we generate four datasets that grow over time. Our datasets, distributed under the CC BY-SA 4.0 license, are available on www.forecastbench.org/datasets.html. Historical updates to the resolution datasets and leaderboards are available on github.com/forecastingresearch/forecastbench-datasets. Bi-weekly question sets are also released via this repository. The repository is mirrored to Hugging Face and available at huggingface.co/datasets/forecastingresearch/forecastbench-datasets.
Dataset Splits	Yes	LLM question set We release a set of 1,000 forecast questions for LLMs every other Sunday at midnight UTC. We sample an equal number of questions from each source to ensure representativeness. Within each source, we then uniformly sample questions across all question categories, aiming for an equal distribution from each category within each source. This ensures that models cannot be overfit to a specific type of question or topic. Human question set The human question set is comprised of 200 forecast questions sampled directly from the LLM question set.
Hardware Specification	No	The paper states that the automated system runs on Google Cloud Platform but does not specify any particular hardware components like GPUs, CPUs, or specific cloud instance types.
Software Dependencies	Yes	We evaluate 17 LLMs on our initial benchmark: GPT-3.5-Turbo-Instruct (Brown et al., 2020), GPT-4 (Open AI, 2023), GPT-4o, Llama-2-70B (Touvron et al., 2023), Llama-3-7B, Llama-3-70B, Mistral-7B (Jiang et al., 2023), Mixtral-8x7B (Jiang et al., 2024a), Mixtral-8x22B, Mistral-Large, Qwen1.5-110B-Chat (Bai et al., 2023), Claude-2.1 (Anthropic, 2023), Claude-3-Haiku, Claude-3.5-Sonnet, Claude-3-Opus (Anthropic, 2024), Gemini 1.5 Flash, and Gemini 1.5 Pro (Gemini Team, 2023). SEARCH_QUERY_MODEL_NAME: The name of the model used to generate search queries. We use gpt-4-1106-preview. SUMMARIZATION_MODEL_NAME: The name of the model used for summarizing articles. We use gpt-3.5-turbo-1106. RANKING_MODEL_NAME: The name of the model used for ranking articles. We use gpt-3.5-turbo-1106.
Experiment Setup	Yes	LLM Parameters. We set the temperature to 0 and the max output token length to 2000. For the zero-shot setting, we set the maximum output token length to 50 since we only request probabilistic forecasts. For the scratchpad prompt, we increase the maximum output token length to 1300 as it requires reasoning and probabilistic forecasts. We initially considered a high token length of 3000, but after observing that the maximum response length was around 1250, we settled on 1300 as the optimal maximum token length. In both cases, the model temperature is set to 0 to ensure stable outputs. Information Retrieval Hyperparameters The hyperparameters were selected following the results in Section E.1 of Halawi et al. (2024), in which they used a greedy search approach to identify the optimal hyperparameters. We display the hyperparameters below: NUM_SEARCH_QUERY_KEYWORDS: The number of keywords used in the search query. For our system, this is set to 6. MAX_WORDS_NEWSCATCHER: The maximum number of words allowed in search queries for the News Catcher API. This is set to 5.