reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

Authors: Eunice Yiu, Maan Qraitem, Anisa Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings show that while GPT-o1, GPT-4V, LLa VA-1.5, and MANTIS identify the what effectively, they struggle with quantifying the how and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages.
Researcher Affiliation	Collaboration	1 University of California, Berkeley 2 Boston University 3 Google Deep Mind 4 Toyota Technological Institute at Chicago
Pseudocode	No	The paper describes a three-stage experimental paradigm (Figure 3) and prompting strategies in Appendix A, but does not present structured pseudocode or algorithm blocks for its methodology.
Open Source Code	Yes	Benchmark (code, data, models) is available at: https://github.com/ey242/Ki VA
Open Datasets	Yes	Benchmark (code, data, models) is available at: https://github.com/ey242/Ki VA. Our dataset utilizes real-world, physically grounded objects curated from established 3D datasets of common household items (Downs et al., 2022) and toys that are familiar to human children (Stojanov et al., 2021).
Dataset Splits	Yes	There are 100 object transformations for each subdomain of transformation, totaling 1,400 object transformations in Ki VA and 2,900 in Ki VA-adults. We recruited 250 adults (21 to 40 years old) on Prolific (Prolific) to complete the benchmark such that every trial was annotated by 3-13 adults. We recruited 42 children (aged 3 to 5 years, mean = 4.07 years, se = 0.11 years) from early childhood centers and Children Helping Science (Science) to complete a random subset of 10 trials (2 trials per transformation domains), totaling 420 responses.
Hardware Specification	Yes	Open-source models ran on an A6000 48 GB single GPU for under 12 hours.
Software Dependencies	No	The paper mentions 'Js Psych' for human task development and 'pillow' for code generation prompts, but does not provide specific version numbers for these or other key software components used for the experimental setup.
Experiment Setup	Yes	For all models, the temperature is set to 1 and the maximum token size is set to 300 (no cap for GPT-o1). We randomize each experiment over three seeds and run each trial (Figure 3) on a model three times with test choices shuffled in order.