A comparison between humans and AI at recognizing objects in unusual poses

Authors: Netta Ollikka, Amro Kamal Mohamed Abbas, Andrea Perin, Markku Kilpeläinen, Stephane Deny

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we compare human subjects with state-of-the-art deep networks for vision and state-of-the-art large vision-language models at recognizing objects in various poses. We collected a dataset of objects viewed in different poses (upright and rotated out-of-plane), to test the ability of humans to recognize these objects, and compare this ability to state-of-the-art deep networks (Figure 1).
Researcher Affiliation Academia Netta Ollikka EMAIL Department of Neuroscience and Biomedical Engineering Aalto University, Espoo, Finland Amro Abbas EMAIL The African Institute for Mathematical Sciences, Mbour-Thies, Senegal Andrea Perin EMAIL Department of Computer Science Aalto University, Espoo, Finland Markku Kilpeläinen EMAIL Department of Psychology and Logopedics University of Helsinki, Finland Stéphane Deny EMAIL Department of Neuroscience and Biomedical Engineering Department of Computer Science Aalto University, Espoo, Finland
Pseudocode No The paper describes methods and procedures in narrative text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All code and data is available at https://github.com/BRAIN-Aalto/unusual_poses.
Open Datasets Yes All code and data is available at https://github.com/BRAIN-Aalto/unusual_poses. We chose 51 different object categories from the Image Net classes (see Appendix D for the list of objects)
Dataset Splits Yes Each observer performed 49 trials, in which the image was in one of three types of poses: upright in 17 trials, rotated-correct (correctly classified by Efficient Net, see Dataset collection 2.1) in 17 trials, and rotated-incorrect (incorrectly classified by Efficient Net) in 15 trials.
Hardware Specification No The paper mentions a "22.5 VIEWPixx display" used for human experiments, but does not provide specific details about the GPUs, CPUs, or other computational hardware used for running the machine tests or model evaluations.
Software Dependencies No The paper mentions using "MATLAB Psychophysics Toolbox" and models from "Pytorch Image Model library (timm)", "Hugging Face Transformers library", and "Torch Hub". However, no specific version numbers are provided for these software components.
Experiment Setup Yes For the large-language models (all models excluding Sig LIP), the experiment was conducted via the API. Each model was shown the 147 images and provided with the following prompt (see examples in Appendix A): What s in this image? A. [label 1] B. [label 2] Choose either A or B and answer in one or two words. For pure vision networks, the choice was made by looking at the highest activation of the softmax output layer for these two labels.