Benchmarks for Physical Reasoning AI
Authors: Andrew Melnik, Robin Schiewer, Moritz Lange, Andrei Ioan Muresanu, mozhgan saeidi, Animesh Garg, Helge Ritter
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Physical reasoning is a crucial aspect in the development of general AI systems, given that human learning starts with interacting with the physical world before progressing to more complex concepts. Although researchers have studied and assessed the physical reasoning of AI approaches through various specific benchmarks, there is no comprehensive approach to evaluating and measuring progress. Therefore, we aim to offer an overview of existing benchmarks and their solution approaches and propose a unified perspective for measuring the physical reasoning capacity of AI systems. We select benchmarks that are designed to test algorithmic performance in physical reasoning tasks. |
| Researcher Affiliation | Academia | Andrew Melnik EMAIL Center for Cognitive Interaction Technology, University of Bielefeld, Germany; Robin Schiewer EMAIL Institute for Neural Computation, Department of Computer Science Ruhr University Bochum, Germany; Andrei Muresanu EMAIL University of Waterloo, Canada Vector Institute, Canada; Mozhgan Saeidi EMAIL Department of Computer Science, University of Toronto, Canada Vector Institute, Canada Department of Computer Science, Stanford University, USA; Animesh Garg EMAIL University of Toronto, Canada Vector Institute, Canada Georgia Institute of Technology, USA; Helge Ritter EMAIL Center for Cognitive Interaction Technology, University of Bielefeld, Germany |
| Pseudocode | No | The paper is a survey and analysis of existing benchmarks and their solution approaches. It describes methods from other works but does not present its own structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to an "Awesome list: https: // github. com/ ndrwmlnk/ Awesome-Benchmarks-for-Physical-Reasoning-AI" which is a resource for benchmarks, not the source code for the methodology or analysis presented in this paper. It states: "We have established a dynamic list of physical reasoning benchmarks1 that can be continuously enhanced through the submission of pull requests for new benchmarks. 1https://github.com/ndrwmlnk/Awesome-Benchmarks-for-Physical-Reasoning-AI". There is no explicit statement or link for the code related to the survey's own methodology. |
| Open Datasets | Yes | In this survey, we discuss 16 datasets and benchmarks (see Tables 1 and 2) to train and evaluate the physical reasoning capabilities of AI agents. The benchmarks we examine here involve a range of physical variables which are central to physical interactions amongst material objects, such as size, position, velocity, direction of movement, force and contact, mass, acceleration and gravity, and, in some cases, even electrical charge. The observability of these variables is strongly affected by the perceptual modalities (e.g. vision, touch) that are available to an agent. |
| Dataset Splits | No | This paper is a survey and review of existing benchmarks. It discusses how *other* papers define their dataset splits (e.g., for CLEVRER: "CLEVRER contains 10,000 training videos, 5,000 validation videos, and 5,000 test videos"), but it does not perform its own experiments that would require defining specific dataset splits for its own methodology. |
| Hardware Specification | No | This paper is a survey and review of existing benchmarks. It discusses hardware used in *other* works when describing solution approaches (e.g., mentioning specific GPU models for benchmarks), but it does not specify any hardware used for its own methodology (the survey and analysis presented in the paper). |
| Software Dependencies | No | This paper is a survey and review of existing benchmarks. It discusses software dependencies used in *other* works (e.g., mentioning "R3D (Tran et al., 2018)" or "Mask R-CNN (He et al., 2017)" for solution approaches to benchmarks), but it does not specify any software dependencies with version numbers for its own methodology. |
| Experiment Setup | No | This paper is a survey and review of existing benchmarks. It discusses experimental setups and hyperparameters from *other* works when describing solution approaches for various benchmarks, but it does not provide specific experimental setup details, hyperparameters, or system-level training settings for its own methodology. |