reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Core Knowledge Deficits in Multi-Modal Language Models

Authors: Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key ﬁndings, collectively demonstrating core knowledge deﬁcits in MLLMs
Researcher Affiliation	Academia	1University of California San Diego 2Johns Hopkins University 3Emory University 4University of North Carolina at Chapel Hill 5Stanford University 6Ben-Gurion University of the Negev 7University of Michigan 8University College London 9Carnegie Mellon University. Correspondence to: Yijiang Li <EMAIL>, Dezhi Luo <EMAIL>, Hokin Deng <EMAIL>.
Pseudocode	No	The paper describes methodologies and experiments but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Project page at https://williamium3000. github.io/core-knowledge/. This is a project page, not an explicit statement of code release or a direct link to a code repository for the methodology.
Open Datasets	Yes	We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science... Project page at https://williamium3000. github.io/core-knowledge/.
Dataset Splits	No	The paper introduces a benchmark called Core Cognition comprising 1,503 samples, but it does not specify any training, validation, or test dataset splits for reproduction.
Hardware Specification	Yes	Inference is performed on clusters equipped with 8 NVIDIA A100 80 GB GPUs.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup	Yes	For each k-choice question, we cyclically rotate the answer options k times, generating k versions with different option orders... We apply a Hybrid Matching mechanism. Speciﬁcally, we prioritize a rule-based template matching approach to extract answers from MLLM responses. If template matching method failed, we turn to a model-based ensemble strategy using four advanced LLMs: Qwen2.5-72B-Instruct, Mixtral-8x7B-Instruct-v0.1, Deep Seek-R1-Distill-Llama-70B, and llama3.1-70B. The LLMbased result is accepted only when at least three of the four models produce consistent extractions; otherwise, the matching is deemed unsuccessful.