DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate this question, we present DISCOVERYBENCH, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DISCOVERYBENCH and find that even the best system scores only 25%. |
| Researcher Affiliation | Collaboration | Bodhisattwa Prasad Majumder α Harshit Surana αβ Dhruv Agarwal γ Bhavana Dalvi Mishra α Abhijeetsingh Meenaβ Aryan Prakharβ Tirth Voraβ Tushar Khotα Ashish Sabharwalα Peter Clarkα αAllen Institute for AI βOpen Locus γUniversity of Massachusetts Amherst |
| Pseudocode | No | The paper describes steps in regular paragraph text and provides an illustrative workflow diagram in Figure 1, but does not contain structured pseudocode or algorithm blocks with formal steps. |
| Open Source Code | Yes | Website: https://github.com/allenai/discoverybench |
| Open Datasets | Yes | Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. The benchmark is released at: https://github.com/allenai/discoverybench. The licenses permitting the re-distribution of datasets collected from past work are provided in Appendix B and Table 3. |
| Dataset Splits | Yes | Table 2 shows the diversity of tasks both in train and test split for DB-REAL. We generate 903 tasks over 48 diverse domains and assign them to development and test sets using an 80/20 split, where each task is additionally tagged with a difficulty level from 1-4. The train split contains 14 metadata files and 25 queries. The test split contains 144 metadata files and 239 queries. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only refers to the use of external LLM APIs (OpenAI API for GPT-based models and Together API for Llama3) without specifying the local hardware environment. |
| Software Dependencies | No | The paper mentions using Open AI API and Together API for LLM models (GPT-4o, GPT-4p, Llama-3-70B) and refers to 'pandas expression' in the context of synthetic data generation. However, it does not provide specific version numbers for these or other ancillary software libraries/solvers used in their experimental setup, nor a comprehensive list of software dependencies with versions. |
| Experiment Setup | No | The paper discusses various experimental evaluations of LLM-based reasoning methods and their performance on DISCOVERYBENCH tasks, including comparisons across different agents and LLMs. However, it does not provide specific hyperparameter values, training configurations, or system-level settings for its own experimental setup in the main text. |