Benchmarking Progress to Infant-Level Physical Reasoning in AI
Authors: Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, Aniruddha Kembhavi
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ten neural-network architectures developed for video understanding on tasks designed to test these models ability to reason about three essential physical principles... We find strikingly consistent results across 60 experiments with multiple systems, training regimes, and evaluation metrics |
| Researcher Affiliation | Collaboration | Luca Weihs,1 Amanda Rose Yuile,2, Renée Baillargeon,2 Cynthia L Fisher,2 Gary Marcus,3 Roozbeh Mottaghi,1 Aniruddha Kembhavi1 1Allen Institute for AI, 2University of Illinois at Urbana-Champaign, 3New York University EMAIL EMAIL {gfmarcus}@gmail.com |
| Pseudocode | No | The paper describes experimental designs and procedures in descriptive text (e.g., 'Secondary Object Familiarization', 'Primary Object Familiarization', 'Continuity Test') but does not include any blocks explicitly labeled as 'Pseudocode' or 'Algorithm', nor are the steps formatted like code. |
| Open Source Code | Yes | Data, Code, and Videos available at: https: // allenai. org/ project/ inflevel... Data and code will be made publicly available. |
| Open Datasets | Yes | We introduce the open-access Infant-Level Physical Reasoning Benchmark (Inf Level)... Inf Level-Lab contains 5,700 videos... Inf Level-Sim: 75,000 videos collected using the robotic-arm enabled agent within the AI2-THOR environment... We consider a large collection of video-understanding and embodied AI models... trained on popular large-scale datasets (e.g., How To100M, Kinetics400, SSv2, etc.)... Appendix G.1 lists Charades, How To100M, Kinetics400, Something-Something v2, Epic Kitchens100 datasets with their licenses. |
| Dataset Splits | Yes | For fair comparison across models, we create a fixed query set we call Joint Train of approximately 25k videos comprised of roughly 5k videos from each of the training splits of the Charades (Ch), How To100M (HT100M), Kinetics400 (K400), Something-Something v2 (SSv2), and Epic Kitchens100 (EK100) datasets... Using the popular Slow Fast 8 8 R50 architecture... pretrained on Kinetics400... we generate representations on 16k videos in the Kinetics400 validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch Video model zoo' and 'Allen Act framework' and 'Py Torch library' but does not specify their version numbers or the versions of other key software dependencies used in the experiments. |
| Experiment Setup | No | The paper describes the design of the Inf Level benchmark, the evaluation methodology, and the baseline models used. While it mentions aspects like OOD metrics and model architectures, it lacks specific hyperparameters (e.g., learning rates, batch sizes, epochs) for the training of models (including the Conv2GRU models trained by the authors) or other concrete system-level training configurations. |