Core Knowledge Deficits in Multi-Modal Language Models
Authors: Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs |
| Researcher Affiliation | Academia | 1University of California San Diego 2Johns Hopkins University 3Emory University 4University of North Carolina at Chapel Hill 5Stanford University 6Ben-Gurion University of the Negev 7University of Michigan 8University College London 9Carnegie Mellon University. Correspondence to: Yijiang Li <EMAIL>, Dezhi Luo <EMAIL>, Hokin Deng <EMAIL>. |
| Pseudocode | No | The paper describes methodologies and experiments but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page at https://williamium3000. github.io/core-knowledge/. This is a project page, not an explicit statement of code release or a direct link to a code repository for the methodology. |
| Open Datasets | Yes | We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science... Project page at https://williamium3000. github.io/core-knowledge/. |
| Dataset Splits | No | The paper introduces a benchmark called Core Cognition comprising 1,503 samples, but it does not specify any training, validation, or test dataset splits for reproduction. |
| Hardware Specification | Yes | Inference is performed on clusters equipped with 8 NVIDIA A100 80 GB GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | For each k-choice question, we cyclically rotate the answer options k times, generating k versions with different option orders... We apply a Hybrid Matching mechanism. Specifically, we prioritize a rule-based template matching approach to extract answers from MLLM responses. If template matching method failed, we turn to a model-based ensemble strategy using four advanced LLMs: Qwen2.5-72B-Instruct, Mixtral-8x7B-Instruct-v0.1, Deep Seek-R1-Distill-Llama-70B, and llama3.1-70B. The LLMbased result is accepted only when at least three of the four models produce consistent extractions; otherwise, the matching is deemed unsuccessful. |