Compositional Causal Reasoning Evaluation in Language Models

Authors: Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We deploy Alg. 1 to evaluate CCR in seven LM architectures, with and without chain-of-thought (CoT) prompting. Even on a simple CCR problem, our framework revealed taxonomically distinct patterns of inconsistent and invalid reasoning, ranging from II to VC.
Researcher Affiliation Collaboration 1Cornell Tech 2Harvard University 3Microsoft Research Cambridge. Correspondence to: Jacqueline Maasch <EMAIL>.
Pseudocode Yes Section 5 suggests one possible procedure for assessing compositional consistency in causal reasoning (Alg. 1), based on compositional properties of the ATE and PNS in graphs with cutpoints.
Open Source Code Yes To facilitate future work, we provide open-source code for randomly generating qualitative and quantitative CCR tasks of scalable graphical complexity.2 Project page: https://jmaasch.github.io/ccr/
Open Datasets No The paper describes generating tasks and sampling data for experiments using code provided, rather than providing access to a static, pre-existing open dataset. For example, in Appendix F Automated Task Generator, it states: "Generate factual and counterfactual text prompts corresponding to the generated SCM and chosen theme. Variables are assigned random human names. ... Sample observational and interventional data from the SCM."
Dataset Splits No The paper describes a process of generating samples and subsampling for computing PNS estimates for evaluation, rather than providing fixed training/test/validation splits for a static dataset. "For each cause-effect pair, we sampled 1000 sets of exogenous variable values. ... 1000 factual responses and 1000 counterfactual responses were randomly subsampled (one per set of five replicate responses). The subsample of factual and counterfactual responses was then used to compute the PNS."
Hardware Specification Yes Model Inference We used a single A100 GPU for all experiments.
Software Dependencies No The paper mentions specific language models used (e.g., Llama 2, Llama 3, Phi-3-Mini, GPT-4o) and that Hugging Face hyperparameters were used, but it does not specify versions for core software components like Python, PyTorch, or other general libraries used for development and execution.
Experiment Setup Yes All {T } in the context prompt take a value of 7, such that all exogenous variables are drawn from Bernoulli distributions parameterized by p = 0.7. ... Reasoning was considered externally valid or internally consistent for a quantity if 90% of the 1000 estimates had RAE 0.1 (threshold chosen prior to analysis).