Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Authors: Carlota Parés Morlans, Michelle Yi, Claire Chen, Sarah A Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, CA, USA 2Department of Psychology, Stanford University, CA, USA 3Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Pseudocode Yes Algorithm 1 Causal-PIK
Open Source Code No The paper does not contain an explicit statement about the release of their own source code or a link to a code repository for the methodology described.
Open Datasets Yes We focus on the Virtual Tools (Allen et al., 2020) and PHYRE (Bakhtin et al., 2019) benchmarks
Dataset Splits Yes For PHYRE, for each of the 10-fold splits from Bakhtin et al., we train a model exclusively on the fold s training set, ensuring that Causal-PIK is tested on previously unseen puzzles. For the PHYRE benchmark, we train 10 separate dynamics models, one per fold. Each model is trained on 20 out of the 25 puzzles assigned to the training set for that fold.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies Yes We adapted the PHYRE-1B benchmark into a suite of online games using Planck.js (Shakiba, 2017), a Java Script rewrite of the Box2D physics engine used in PHYRE (Bakhtin et al., 2019).
Experiment Setup Yes To initialize the GP for both Virtual Tools and PHYRE, we use ninitial = 9 initial data points. First, we use a Sobol sequence generator to sample a set of ncandidate = 500 candidate actions. Then, we evaluate the acquisition function at each of these ncandidate actions. Adopting the intuitive physics procedure proposed by Allen et al., we approximate the outcome of the nbest = 5 candidate actions with the highest acquisition function values using a probabilistic simulation of the task. We set npred to 20, which usually captures one collision but not the full roll-out.