reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate this question, we present DISCOVERYBENCH, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DISCOVERYBENCH and find that even the best system scores only 25%.
Researcher Affiliation	Collaboration	Bodhisattwa Prasad Majumder α Harshit Surana αβ Dhruv Agarwal γ Bhavana Dalvi Mishra α Abhijeetsingh Meenaβ Aryan Prakharβ Tirth Voraβ Tushar Khotα Ashish Sabharwalα Peter Clarkα αAllen Institute for AI βOpen Locus γUniversity of Massachusetts Amherst
Pseudocode	No	The paper describes steps in regular paragraph text and provides an illustrative workflow diagram in Figure 1, but does not contain structured pseudocode or algorithm blocks with formal steps.
Open Source Code	Yes	Website: https://github.com/allenai/discoverybench
Open Datasets	Yes	Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. The benchmark is released at: https://github.com/allenai/discoverybench. The licenses permitting the re-distribution of datasets collected from past work are provided in Appendix B and Table 3.
Dataset Splits	Yes	Table 2 shows the diversity of tasks both in train and test split for DB-REAL. We generate 903 tasks over 48 diverse domains and assign them to development and test sets using an 80/20 split, where each task is additionally tagged with a difficulty level from 1-4. The train split contains 14 metadata files and 25 queries. The test split contains 144 metadata files and 239 queries.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only refers to the use of external LLM APIs (OpenAI API for GPT-based models and Together API for Llama3) without specifying the local hardware environment.
Software Dependencies	No	The paper mentions using Open AI API and Together API for LLM models (GPT-4o, GPT-4p, Llama-3-70B) and refers to 'pandas expression' in the context of synthetic data generation. However, it does not provide specific version numbers for these or other ancillary software libraries/solvers used in their experimental setup, nor a comprehensive list of software dependencies with versions.
Experiment Setup	No	The paper discusses various experimental evaluations of LLM-based reasoning methods and their performance on DISCOVERYBENCH tasks, including comparisons across different agents and LLMs. However, it does not provide specific hyperparameter values, training configurations, or system-level settings for its own experimental setup in the main text.