Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension ability
Authors: Yujin Han, Lei Xu, Sirui Chen, Difan Zou, Chaochao Lu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate deep structure comprehension in mainstream LLMs across tasks, revealing widespread deep understanding that strongly correlates with accuracy (in Section 4.2). Further comparison between ADCE and AICE shows tested closed-source LLMs excel in deep comprehension, while tested open-source LLMs shift from surface to deep understanding with scale (in Section 4.4). In this section, we experimentally explore three critical questions: (1) Deep structure comprehension in LLMs: Do LLMs process questions through an understanding of the deep structure of problems? We analyze this using the proposed ADCE in Section 4.2. (2) Prerequisite of deep structure comprehension: What prerequisite enables LLMs to utilize deep structure in their responses? Insights into this question are discussed in Section 4.3? (3) Comparative influence of deep and surface structures: Which has a stronger causal effect on the outputs of LLMs deep or surface structures? These investigations detailed in Section 4.4 collectively address the queries raised in Section 1, assessing whether LLMs are deep thinkers or merely surface structure learners. Additionally, to further support Section 3.4, we evaluate whether ADCE assesses core semantic understanding more reliably than accuracy under spurious correlations (in Section 4.5). |
| Researcher Affiliation | Collaboration | 1Shanghai Artificial Intelligence Laboratory, 2The University of Hong Kong, 3Tongji University |
| Pseudocode | Yes | Algorithm 1 provides the detailed algorithmic steps required to estimate ADCE, which includes the following: First, we perform initial inference on the full dataset to select samples with correct answers. Then, for these correctly answered samples, we apply interventions using two strategies: Masking and Rephrasing. Finally, we conduct a second round of inference on the intervened samples and calculate ADCE based on the inference results. Algorithm 1: Approximated Direct Causal Effect (ADCE) Estimation in LLMs... Algorithm 2: Intervention Data Generation Method M... Algorithm 3: Rephrase By Agent |
| Open Source Code | No | The code for our project is available at ADCE Project. |
| Open Datasets | Yes | We employ five popular benchmarks across mathematics, logic, and commonsense knowledge. For mathematics, we consider 2-Digit Multiplication task (bench authors, 2023) and GSM8k (Cobbe et al., 2021) for multi-step mathematical problems. Logical reasoning tasks include Word Unscrambling (bench authors, 2023), which requires unscrambling given letters to form an English word for implicit reasoning, and the binary Analytic Entailment task (bench authors, 2023) for linguistic entailment. Commonsense knowledge benchmarks include Commonsense QA (Talmor et al., 2018), a multiple-choice task covering daily life knowledge. |
| Dataset Splits | Yes | For the test data, after sampling the training set, we apply the same sampling rules to the remaining population. We select 200 samples each from the majority and minority groups within this population. ... Specifically, we fine-tune Llama-3-8b on Analytic Entailment and compare its ADCE before and after SFT. ... We then divided these 70 sets for training and testing with a ratio of 6 : 4. Consequently, we obtained a training set consisting of 210 samples derived from 42 original samples and a test set comprising 140 samples, which were derived from the intervention on 28 original samples. |
| Hardware Specification | Yes | To fine-tune the llama-based models, we utilize the llama-recipes library4 and train the models on a cloud server with 2 NVIDIA Tesla A100 GPUs with 80G memory of each. |
| Software Dependencies | No | To fine-tune the llama-based models, we utilize the llama-recipes library4 and train the models on a cloud server with 2 NVIDIA Tesla A100 GPUs with 80G memory of each. We employ Lo RA (Hu et al., 2022) technique from the peft library5 for memory-efficient training. |
| Experiment Setup | Yes | We set the batch size to be 20 and set the learning rate to be 0.0003 for both llama-3-8b and llama-3-70b. For other parameters, we use the default value as defined in the official code from llama-recipes library. We train the models until convergence, and both llama-3-8b and llama-3-70b converge within 200 steps. ... We set the batch size to be 50, and set the learning rate to be 0.001 and 0.0003 for llama-3-8b and llama-3-70b, respectively. For other parameters, we use the default value as defined in the official code from llama-recipes library. We train the models until convergence. In all training cases, the models converge within 250 steps. |