Core Knowledge Deficits in Multi-Modal Language Models

Authors: Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs
Researcher Affiliation Academia 1University of California San Diego 2Johns Hopkins University 3Emory University 4University of North Carolina at Chapel Hill 5Stanford University 6Ben-Gurion University of the Negev 7University of Michigan 8University College London 9Carnegie Mellon University. Correspondence to: Yijiang Li <EMAIL>, Dezhi Luo <EMAIL>, Hokin Deng <EMAIL>.
Pseudocode No The paper describes methodologies and experiments but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project page at https://williamium3000. github.io/core-knowledge/. This is a project page, not an explicit statement of code release or a direct link to a code repository for the methodology.
Open Datasets Yes We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science... Project page at https://williamium3000. github.io/core-knowledge/.
Dataset Splits No The paper introduces a benchmark called Core Cognition comprising 1,503 samples, but it does not specify any training, validation, or test dataset splits for reproduction.
Hardware Specification Yes Inference is performed on clusters equipped with 8 NVIDIA A100 80 GB GPUs.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup Yes For each k-choice question, we cyclically rotate the answer options k times, generating k versions with different option orders... We apply a Hybrid Matching mechanism. Specifically, we prioritize a rule-based template matching approach to extract answers from MLLM responses. If template matching method failed, we turn to a model-based ensemble strategy using four advanced LLMs: Qwen2.5-72B-Instruct, Mixtral-8x7B-Instruct-v0.1, Deep Seek-R1-Distill-Llama-70B, and llama3.1-70B. The LLMbased result is accepted only when at least three of the four models produce consistent extractions; otherwise, the matching is deemed unsuccessful.