KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Authors: Eunice Yiu, Maan Qraitem, Anisa Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings show that while GPT-o1, GPT-4V, LLa VA-1.5, and MANTIS identify the what effectively, they struggle with quantifying the how and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. |
| Researcher Affiliation | Collaboration | 1 University of California, Berkeley 2 Boston University 3 Google Deep Mind 4 Toyota Technological Institute at Chicago |
| Pseudocode | No | The paper describes a three-stage experimental paradigm (Figure 3) and prompting strategies in Appendix A, but does not present structured pseudocode or algorithm blocks for its methodology. |
| Open Source Code | Yes | Benchmark (code, data, models) is available at: https://github.com/ey242/Ki VA |
| Open Datasets | Yes | Benchmark (code, data, models) is available at: https://github.com/ey242/Ki VA. Our dataset utilizes real-world, physically grounded objects curated from established 3D datasets of common household items (Downs et al., 2022) and toys that are familiar to human children (Stojanov et al., 2021). |
| Dataset Splits | Yes | There are 100 object transformations for each subdomain of transformation, totaling 1,400 object transformations in Ki VA and 2,900 in Ki VA-adults. We recruited 250 adults (21 to 40 years old) on Prolific (Prolific) to complete the benchmark such that every trial was annotated by 3-13 adults. We recruited 42 children (aged 3 to 5 years, mean = 4.07 years, se = 0.11 years) from early childhood centers and Children Helping Science (Science) to complete a random subset of 10 trials (2 trials per transformation domains), totaling 420 responses. |
| Hardware Specification | Yes | Open-source models ran on an A6000 48 GB single GPU for under 12 hours. |
| Software Dependencies | No | The paper mentions 'Js Psych' for human task development and 'pillow' for code generation prompts, but does not provide specific version numbers for these or other key software components used for the experimental setup. |
| Experiment Setup | Yes | For all models, the temperature is set to 1 and the maximum token size is set to 300 (no cap for GPT-o1). We randomize each experiment over three seeds and run each trial (Figure 3) on a model three times with test choices shuffled in order. |