reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Authors: Carlota Parés Morlans, Michelle Yi, Claire Chen, Sarah A Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, CA, USA 2Department of Psychology, Stanford University, CA, USA 3Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Pseudocode	Yes	Algorithm 1 Causal-PIK
Open Source Code	No	The paper does not contain an explicit statement about the release of their own source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We focus on the Virtual Tools (Allen et al., 2020) and PHYRE (Bakhtin et al., 2019) benchmarks
Dataset Splits	Yes	For PHYRE, for each of the 10-fold splits from Bakhtin et al., we train a model exclusively on the fold s training set, ensuring that Causal-PIK is tested on previously unseen puzzles. For the PHYRE benchmark, we train 10 separate dynamics models, one per fold. Each model is trained on 20 out of the 25 puzzles assigned to the training set for that fold.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	Yes	We adapted the PHYRE-1B benchmark into a suite of online games using Planck.js (Shakiba, 2017), a Java Script rewrite of the Box2D physics engine used in PHYRE (Bakhtin et al., 2019).
Experiment Setup	Yes	To initialize the GP for both Virtual Tools and PHYRE, we use ninitial = 9 initial data points. First, we use a Sobol sequence generator to sample a set of ncandidate = 500 candidate actions. Then, we evaluate the acquisition function at each of these ncandidate actions. Adopting the intuitive physics procedure proposed by Allen et al., we approximate the outcome of the nbest = 5 candidate actions with the highest acquisition function values using a probabilistic simulation of the task. We set npred to 20, which usually captures one collision but not the full roll-out.