reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension ability

Authors: Yujin Han, Lei Xu, Sirui Chen, Difan Zou, Chaochao Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate deep structure comprehension in mainstream LLMs across tasks, revealing widespread deep understanding that strongly correlates with accuracy (in Section 4.2). Further comparison between ADCE and AICE shows tested closed-source LLMs excel in deep comprehension, while tested open-source LLMs shift from surface to deep understanding with scale (in Section 4.4). In this section, we experimentally explore three critical questions: (1) Deep structure comprehension in LLMs: Do LLMs process questions through an understanding of the deep structure of problems? We analyze this using the proposed ADCE in Section 4.2. (2) Prerequisite of deep structure comprehension: What prerequisite enables LLMs to utilize deep structure in their responses? Insights into this question are discussed in Section 4.3? (3) Comparative influence of deep and surface structures: Which has a stronger causal effect on the outputs of LLMs deep or surface structures? These investigations detailed in Section 4.4 collectively address the queries raised in Section 1, assessing whether LLMs are deep thinkers or merely surface structure learners. Additionally, to further support Section 3.4, we evaluate whether ADCE assesses core semantic understanding more reliably than accuracy under spurious correlations (in Section 4.5).
Researcher Affiliation	Collaboration	1Shanghai Artificial Intelligence Laboratory, 2The University of Hong Kong, 3Tongji University
Pseudocode	Yes	Algorithm 1 provides the detailed algorithmic steps required to estimate ADCE, which includes the following: First, we perform initial inference on the full dataset to select samples with correct answers. Then, for these correctly answered samples, we apply interventions using two strategies: Masking and Rephrasing. Finally, we conduct a second round of inference on the intervened samples and calculate ADCE based on the inference results. Algorithm 1: Approximated Direct Causal Effect (ADCE) Estimation in LLMs... Algorithm 2: Intervention Data Generation Method M... Algorithm 3: Rephrase By Agent
Open Source Code	No	The code for our project is available at ADCE Project.
Open Datasets	Yes	We employ five popular benchmarks across mathematics, logic, and commonsense knowledge. For mathematics, we consider 2-Digit Multiplication task (bench authors, 2023) and GSM8k (Cobbe et al., 2021) for multi-step mathematical problems. Logical reasoning tasks include Word Unscrambling (bench authors, 2023), which requires unscrambling given letters to form an English word for implicit reasoning, and the binary Analytic Entailment task (bench authors, 2023) for linguistic entailment. Commonsense knowledge benchmarks include Commonsense QA (Talmor et al., 2018), a multiple-choice task covering daily life knowledge.
Dataset Splits	Yes	For the test data, after sampling the training set, we apply the same sampling rules to the remaining population. We select 200 samples each from the majority and minority groups within this population. ... Specifically, we fine-tune Llama-3-8b on Analytic Entailment and compare its ADCE before and after SFT. ... We then divided these 70 sets for training and testing with a ratio of 6 : 4. Consequently, we obtained a training set consisting of 210 samples derived from 42 original samples and a test set comprising 140 samples, which were derived from the intervention on 28 original samples.
Hardware Specification	Yes	To fine-tune the llama-based models, we utilize the llama-recipes library4 and train the models on a cloud server with 2 NVIDIA Tesla A100 GPUs with 80G memory of each.
Software Dependencies	No	To fine-tune the llama-based models, we utilize the llama-recipes library4 and train the models on a cloud server with 2 NVIDIA Tesla A100 GPUs with 80G memory of each. We employ Lo RA (Hu et al., 2022) technique from the peft library5 for memory-efficient training.
Experiment Setup	Yes	We set the batch size to be 20 and set the learning rate to be 0.0003 for both llama-3-8b and llama-3-70b. For other parameters, we use the default value as defined in the official code from llama-recipes library. We train the models until convergence, and both llama-3-8b and llama-3-70b converge within 200 steps. ... We set the batch size to be 50, and set the learning rate to be 0.001 and 0.0003 for llama-3-8b and llama-3-70b, respectively. For other parameters, we use the default value as defined in the official code from llama-recipes library. We train the models until convergence. In all training cases, the models converge within 250 steps.