reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Authors: Nolan Koblischke, Hyunseok Jang, Kristen Menou, Mohamad Ali-Dib

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an openended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Benchv1 and planned extensions should help map out AI progress towards scientific discovery capabilities.
Researcher Affiliation	Academia	Nolan Koblischke 1 Hyunseok Jang 1 Kristen Menou 1 Mohamad Ali-Dib 2 1University of Toronto 2New York University Abu Dhabi. Correspondence to: Nolan Koblischke <EMAIL>, Kristen Menou <EMAIL>.
Pseudocode	No	The paper describes methods and processes, particularly for the simulation environment and the baseline agent, but does not present them in structured pseudocode or algorithm blocks. Appendix F shows 'Prompt Templates', which are inputs to the agent, not algorithms for the methodology itself.
Open Source Code	Yes	Gravity-Bench-v1 is available at https://github.com/Nolan Koblischke/Gravity Bench and https://huggingface.co/datasets/ Gravity Bench/Gravity Bench.
Open Datasets	Yes	Gravity-Bench-v1 is available at https://github.com/Nolan Koblischke/Gravity Bench and https://huggingface.co/datasets/ Gravity Bench/Gravity Bench.
Dataset Splits	No	The paper describes different observation protocols ('full-obs' and 'budget-obs') and evaluation against baselines ('expert-ref-100'), but it does not specify explicit training/test/validation dataset splits typically used for training machine learning models. The benchmark provides tasks and simulations for agents to interact with, rather than predefined dataset splits for model training by the authors.
Hardware Specification	No	The paper evaluates various AI models (e.g., GPT-4o, Claude 3.5 Sonnet) but does not provide specific hardware details (like GPU models, CPU types, or memory) used by the authors to run their simulations or to evaluate these models.
Software Dependencies	No	The paper mentions several software tools and libraries: "All simulations are implemented using Rebound (Rein & Liu, 2012; Tamayo et al., 2020)... For most problems we use WHFast (Rein & Tamayo, 2015)... The agent can use our observe tool and a Python interpreter adapted from Langchain (Chase, 2022) with access to packages like numpy (Harris et al., 2020), scipy (Virtanen et al., 2020) and pandas (pandas development team, 2020)". While it cites the papers for these libraries, it does not provide specific version numbers for these software components (e.g., "Rebound vX.Y" or "Langchain vZ.W").
Experiment Setup	Yes	The core design principle behind our benchmark is the concept of a rigorously-simulated, partially-observable environment. All simulations are implemented using Rebound (Rein & Liu, 2012; Tamayo et al., 2020)... Our standard Rebound simulation takes as input the stellar binary parameters (point masses, 3D positions, and 3D momentums), and any additional forces present. Rebound then solves Newton s gravity equations forward in time, for mostly 10 orbits... The integration timestep is conservatively chosen to be one-five-thousandth of the system s orbital period. For problems where forces other than gravity are present, or the gravitational law has been modified, WHFast is not adequate. We then use IAS15, an adaptive time-step 15th-order integrator where errors are kept to below machine precision. We design a baseline agent around a Re Act-style scaffold (Yao et al., 2023). The baseline agent prompts are shown in Appendix F.