reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

Authors: Shaobo Cui, Junyou Li, Luca Mouchel, Yiyang Feng, Boi Faltings

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical studies on 21 high-proﬁle LLMs, including GPT4, Claude3, and LLa MA3-70B, we have favoring evidence that current models struggle to maintain epistemic consistency in identifying the polarity and intensity of intermediates in causal reasoning.
Researcher Affiliation	Academia	Shaobo Cui1, Junyou Li2, Luca Mouchel1, Yiyang Feng1, Boi Faltings1 1EPFL, Switzerland 2University of Waterloo, Canada shaobo.cui@epﬂ.ch, EMAIL, luca.mouchel@epﬂ.ch, yiyang.feng@epﬂ.ch, boi.faltings@epﬂ.ch
Pseudocode	No	The paper describes methods and metrics using mathematical formulations and textual descriptions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/cui-shaobo/causal-consistency
Open Datasets	Yes	To ensure the defeasibility of causal pairs, allowing models to generate intermediates with varying polarity and intensity, we utilize the test dataset of ε-CAUSAL (Cui et al. 2024b) as our foundational dataset, which comprises 1,970 defeasible cause-effect pairs.
Dataset Splits	Yes	We utilize the test dataset of ε-CAUSAL (Cui et al. 2024b) as our foundational dataset, which comprises 1,970 defeasible cause-effect pairs.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only generally acknowledges IT and financial support.
Software Dependencies	No	The paper mentions several software libraries in its references (e.g., NumPy, Matplotlib, PyTorch, Transformers, NLTK, Accelerate) but does not provide specific version numbers for these or any other ancillary software dependencies used in their experimental setup.
Experiment Setup	Yes	The prompt for generating these ﬁne-grained intermediates is presented in Figure 6; (ii) Intermediate ranking: From these generated intermediates, we use the same LLM to rank the intermediates to identify their polarities (supporting or defeating) and intensity. The prompt for ranking these ﬁne-grained intermediates is presented in Figure 7;