reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking Progress to Infant-Level Physical Reasoning in AI

Authors: Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, Aniruddha Kembhavi

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate ten neural-network architectures developed for video understanding on tasks designed to test these models ability to reason about three essential physical principles... We find strikingly consistent results across 60 experiments with multiple systems, training regimes, and evaluation metrics
Researcher Affiliation	Collaboration	Luca Weihs,1 Amanda Rose Yuile,2, Renée Baillargeon,2 Cynthia L Fisher,2 Gary Marcus,3 Roozbeh Mottaghi,1 Aniruddha Kembhavi1 1Allen Institute for AI, 2University of Illinois at Urbana-Champaign, 3New York University EMAIL EMAIL {gfmarcus}@gmail.com
Pseudocode	No	The paper describes experimental designs and procedures in descriptive text (e.g., 'Secondary Object Familiarization', 'Primary Object Familiarization', 'Continuity Test') but does not include any blocks explicitly labeled as 'Pseudocode' or 'Algorithm', nor are the steps formatted like code.
Open Source Code	Yes	Data, Code, and Videos available at: https: // allenai. org/ project/ inflevel... Data and code will be made publicly available.
Open Datasets	Yes	We introduce the open-access Infant-Level Physical Reasoning Benchmark (Inf Level)... Inf Level-Lab contains 5,700 videos... Inf Level-Sim: 75,000 videos collected using the robotic-arm enabled agent within the AI2-THOR environment... We consider a large collection of video-understanding and embodied AI models... trained on popular large-scale datasets (e.g., How To100M, Kinetics400, SSv2, etc.)... Appendix G.1 lists Charades, How To100M, Kinetics400, Something-Something v2, Epic Kitchens100 datasets with their licenses.
Dataset Splits	Yes	For fair comparison across models, we create a fixed query set we call Joint Train of approximately 25k videos comprised of roughly 5k videos from each of the training splits of the Charades (Ch), How To100M (HT100M), Kinetics400 (K400), Something-Something v2 (SSv2), and Epic Kitchens100 (EK100) datasets... Using the popular Slow Fast 8 8 R50 architecture... pretrained on Kinetics400... we generate representations on 16k videos in the Kinetics400 validation set.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions 'Py Torch Video model zoo' and 'Allen Act framework' and 'Py Torch library' but does not specify their version numbers or the versions of other key software dependencies used in the experiments.
Experiment Setup	No	The paper describes the design of the Inf Level benchmark, the evaluation methodology, and the baseline models used. While it mentions aspects like OOD metrics and model architectures, it lacks specific hyperparameters (e.g., learning rates, batch sizes, epochs) for the training of models (including the Conv2GRU models trained by the authors) or other concrete system-level training configurations.