Nuance Matters: Probing Epistemic Consistency in Causal Reasoning
Authors: Shaobo Cui, Junyou Li, Luca Mouchel, Yiyang Feng, Boi Faltings
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical studies on 21 high-profile LLMs, including GPT4, Claude3, and LLa MA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning. |
| Researcher Affiliation | Academia | Shaobo Cui1, Junyou Li2, Luca Mouchel1, Yiyang Feng1, Boi Faltings1 1EPFL, Switzerland 2University of Waterloo, Canada shaobo.cui@epfl.ch, EMAIL, luca.mouchel@epfl.ch, yiyang.feng@epfl.ch, boi.faltings@epfl.ch |
| Pseudocode | No | The paper describes methods and metrics using mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/cui-shaobo/causal-consistency |
| Open Datasets | Yes | To ensure the defeasibility of causal pairs, allowing models to generate intermediates with varying polarity and intensity, we utilize the test dataset of ε-CAUSAL (Cui et al. 2024b) as our foundational dataset, which comprises 1,970 defeasible cause-effect pairs. |
| Dataset Splits | Yes | We utilize the test dataset of ε-CAUSAL (Cui et al. 2024b) as our foundational dataset, which comprises 1,970 defeasible cause-effect pairs. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only generally acknowledges IT and financial support. |
| Software Dependencies | No | The paper mentions several software libraries in its references (e.g., NumPy, Matplotlib, PyTorch, Transformers, NLTK, Accelerate) but does not provide specific version numbers for these or any other ancillary software dependencies used in their experimental setup. |
| Experiment Setup | Yes | The prompt for generating these fine-grained intermediates is presented in Figure 6; (ii) Intermediate ranking: From these generated intermediates, we use the same LLM to rank the intermediates to identify their polarities (supporting or defeating) and intensity. The prompt for ranking these fine-grained intermediates is presented in Figure 7; |