Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel
Authors: Carlota Parés Morlans, Michelle Yi, Claire Chen, Sarah A Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, CA, USA 2Department of Psychology, Stanford University, CA, USA 3Department of Computer Science and Technology, University of Cambridge, Cambridge, UK. |
| Pseudocode | Yes | Algorithm 1 Causal-PIK |
| Open Source Code | No | The paper does not contain an explicit statement about the release of their own source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We focus on the Virtual Tools (Allen et al., 2020) and PHYRE (Bakhtin et al., 2019) benchmarks |
| Dataset Splits | Yes | For PHYRE, for each of the 10-fold splits from Bakhtin et al., we train a model exclusively on the fold s training set, ensuring that Causal-PIK is tested on previously unseen puzzles. For the PHYRE benchmark, we train 10 separate dynamics models, one per fold. Each model is trained on 20 out of the 25 puzzles assigned to the training set for that fold. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | Yes | We adapted the PHYRE-1B benchmark into a suite of online games using Planck.js (Shakiba, 2017), a Java Script rewrite of the Box2D physics engine used in PHYRE (Bakhtin et al., 2019). |
| Experiment Setup | Yes | To initialize the GP for both Virtual Tools and PHYRE, we use ninitial = 9 initial data points. First, we use a Sobol sequence generator to sample a set of ncandidate = 500 candidate actions. Then, we evaluate the acquisition function at each of these ncandidate actions. Adopting the intuitive physics procedure proposed by Allen et al., we approximate the outcome of the nbest = 5 candidate actions with the highest acquisition function values using a probabilistic simulation of the task. We set npred to 20, which usually captures one collision but not the full roll-out. |