Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?
Authors: Antonia Wüst, Tim Tobiasch, Lukas Helff, Inga Ibs, Wolfgang Stammer, Devendra Singh Dhami, Constantin A. Rothkopf, Kristian Kersting
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition. |
| Researcher Affiliation | Academia | 1AIML Lab, TU Darmstadt 2Hessian Center for AI (hessian.ai) 3Institute of Psychology, TU Darmstadt 4Centre for Cognitive Science, TU Darmstadt 5Uncertainty in AI Group, TU Eindhoven 6German Center for AI (DFKI). |
| Pseudocode | No | The paper describes methods in narrative text and refers to prompts used, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ ml-research/bongard-in-wonderland. |
| Open Datasets | Yes | For our evaluations, we considered the 100 original Bongard problems of (Bongard & Hawkins, 1970). We used the dataset variation of (Depeweg et al., 2024), which contains high-resolution images of the original diagrams. |
| Dataset Splits | No | The paper describes tasks for evaluating models on Bongard Problems and a human study, but does not specify explicit training, validation, or test dataset splits for machine learning models. For the human study it states, "In each session, participants were asked to solve 33 BPs. The BPs were shown in the original order (#2 to #100). BP#1 was used in the instructions." |
| Hardware Specification | Yes | LLa VA-One Vision 72B (Li et al., 2025) and Intern VL 2.5 78B (Chen et al., 2024). For simplicity, in the following the models are referred to as o1, GPT-4o, Claude, Gemini 2.0 and 1.5, Qwen2VL, LLa VA-One Vision, and Intern VL 2.5 respectively. The specific models and their configurations are given in Suppl. A.2. ... Table 2: Details on the models used in the evaluations. ... LLa VA-One Vision ... Devices 3 GPUs (NVIDIA A100-SXM4-80GB) ... Qwen2VL ... Devices 3 GPUs (NVIDIA A100-SXM4-80GB) ... Intern VL2.5 ... Devices 4 GPUs (NVIDIA A100-SXM4-80GB) |
| Software Dependencies | No | Table 2 lists frameworks such as 'openai (API)', 'anthropic (API)', 'google (API)', 'transformers', and 'llm deploy'. However, it does not provide specific version numbers for 'transformers' or 'llm deploy', nor does it specify client-side versions for the API frameworks that would typically be required for exact reproduction. |
| Experiment Setup | Yes | Model Setup. For all evaluations we tasked our selection of VLMs with solving each BP three times. ... Table 2: Details on the models used in the evaluations. ... o1 o1-2024-12-17 Default, max tokens=2048 ... GPT-4o gpt-4o-2024-08-06 Default, max tokens=2048 ... Claude claude-3-5-sonnet-20241022 Default, max tokens=2048 ... Gemini 2.0 gemini-2.0-flash-exp Default, max tokens=2048 ... Gemini 1.5 gemini-1.5-pro Default, max tokens=2048 ... GPT-4o (LLM-Judge) gpt-4o-2024-08-06 Default, temperature=0.0 max tokens=2048 |