reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarks for Physical Reasoning AI

Authors: Andrew Melnik, Robin Schiewer, Moritz Lange, Andrei Ioan Muresanu, mozhgan saeidi, Animesh Garg, Helge Ritter

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Physical reasoning is a crucial aspect in the development of general AI systems, given that human learning starts with interacting with the physical world before progressing to more complex concepts. Although researchers have studied and assessed the physical reasoning of AI approaches through various speciﬁc benchmarks, there is no comprehensive approach to evaluating and measuring progress. Therefore, we aim to oﬀer an overview of existing benchmarks and their solution approaches and propose a uniﬁed perspective for measuring the physical reasoning capacity of AI systems. We select benchmarks that are designed to test algorithmic performance in physical reasoning tasks.
Researcher Affiliation	Academia	Andrew Melnik EMAIL Center for Cognitive Interaction Technology, University of Bielefeld, Germany; Robin Schiewer EMAIL Institute for Neural Computation, Department of Computer Science Ruhr University Bochum, Germany; Andrei Muresanu EMAIL University of Waterloo, Canada Vector Institute, Canada; Mozhgan Saeidi EMAIL Department of Computer Science, University of Toronto, Canada Vector Institute, Canada Department of Computer Science, Stanford University, USA; Animesh Garg EMAIL University of Toronto, Canada Vector Institute, Canada Georgia Institute of Technology, USA; Helge Ritter EMAIL Center for Cognitive Interaction Technology, University of Bielefeld, Germany
Pseudocode	No	The paper is a survey and analysis of existing benchmarks and their solution approaches. It describes methods from other works but does not present its own structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to an "Awesome list: https: // github. com/ ndrwmlnk/ Awesome-Benchmarks-for-Physical-Reasoning-AI" which is a resource for benchmarks, not the source code for the methodology or analysis presented in this paper. It states: "We have established a dynamic list of physical reasoning benchmarks1 that can be continuously enhanced through the submission of pull requests for new benchmarks. 1https://github.com/ndrwmlnk/Awesome-Benchmarks-for-Physical-Reasoning-AI". There is no explicit statement or link for the code related to the survey's own methodology.
Open Datasets	Yes	In this survey, we discuss 16 datasets and benchmarks (see Tables 1 and 2) to train and evaluate the physical reasoning capabilities of AI agents. The benchmarks we examine here involve a range of physical variables which are central to physical interactions amongst material objects, such as size, position, velocity, direction of movement, force and contact, mass, acceleration and gravity, and, in some cases, even electrical charge. The observability of these variables is strongly aﬀected by the perceptual modalities (e.g. vision, touch) that are available to an agent.
Dataset Splits	No	This paper is a survey and review of existing benchmarks. It discusses how other papers define their dataset splits (e.g., for CLEVRER: "CLEVRER contains 10,000 training videos, 5,000 validation videos, and 5,000 test videos"), but it does not perform its own experiments that would require defining specific dataset splits for its own methodology.
Hardware Specification	No	This paper is a survey and review of existing benchmarks. It discusses hardware used in other works when describing solution approaches (e.g., mentioning specific GPU models for benchmarks), but it does not specify any hardware used for its own methodology (the survey and analysis presented in the paper).
Software Dependencies	No	This paper is a survey and review of existing benchmarks. It discusses software dependencies used in other works (e.g., mentioning "R3D (Tran et al., 2018)" or "Mask R-CNN (He et al., 2017)" for solution approaches to benchmarks), but it does not specify any software dependencies with version numbers for its own methodology.
Experiment Setup	No	This paper is a survey and review of existing benchmarks. It discusses experimental setups and hyperparameters from other works when describing solution approaches for various benchmarks, but it does not provide specific experimental setup details, hyperparameters, or system-level training settings for its own methodology.