reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

Authors: Antonia Wüst, Tim Tobiasch, Lukas Helff, Inga Ibs, Wolfgang Stammer, Devendra Singh Dhami, Constantin A. Rothkopf, Kristian Kersting

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition.
Researcher Affiliation	Academia	1AIML Lab, TU Darmstadt 2Hessian Center for AI (hessian.ai) 3Institute of Psychology, TU Darmstadt 4Centre for Cognitive Science, TU Darmstadt 5Uncertainty in AI Group, TU Eindhoven 6German Center for AI (DFKI).
Pseudocode	No	The paper describes methods in narrative text and refers to prompts used, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/ ml-research/bongard-in-wonderland.
Open Datasets	Yes	For our evaluations, we considered the 100 original Bongard problems of (Bongard & Hawkins, 1970). We used the dataset variation of (Depeweg et al., 2024), which contains high-resolution images of the original diagrams.
Dataset Splits	No	The paper describes tasks for evaluating models on Bongard Problems and a human study, but does not specify explicit training, validation, or test dataset splits for machine learning models. For the human study it states, "In each session, participants were asked to solve 33 BPs. The BPs were shown in the original order (#2 to #100). BP#1 was used in the instructions."
Hardware Specification	Yes	LLa VA-One Vision 72B (Li et al., 2025) and Intern VL 2.5 78B (Chen et al., 2024). For simplicity, in the following the models are referred to as o1, GPT-4o, Claude, Gemini 2.0 and 1.5, Qwen2VL, LLa VA-One Vision, and Intern VL 2.5 respectively. The specific models and their configurations are given in Suppl. A.2. ... Table 2: Details on the models used in the evaluations. ... LLa VA-One Vision ... Devices 3 GPUs (NVIDIA A100-SXM4-80GB) ... Qwen2VL ... Devices 3 GPUs (NVIDIA A100-SXM4-80GB) ... Intern VL2.5 ... Devices 4 GPUs (NVIDIA A100-SXM4-80GB)
Software Dependencies	No	Table 2 lists frameworks such as 'openai (API)', 'anthropic (API)', 'google (API)', 'transformers', and 'llm deploy'. However, it does not provide specific version numbers for 'transformers' or 'llm deploy', nor does it specify client-side versions for the API frameworks that would typically be required for exact reproduction.
Experiment Setup	Yes	Model Setup. For all evaluations we tasked our selection of VLMs with solving each BP three times. ... Table 2: Details on the models used in the evaluations. ... o1 o1-2024-12-17 Default, max tokens=2048 ... GPT-4o gpt-4o-2024-08-06 Default, max tokens=2048 ... Claude claude-3-5-sonnet-20241022 Default, max tokens=2048 ... Gemini 2.0 gemini-2.0-flash-exp Default, max tokens=2048 ... Gemini 1.5 gemini-1.5-pro Default, max tokens=2048 ... GPT-4o (LLM-Judge) gpt-4o-2024-08-06 Default, temperature=0.0 max tokens=2048