reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compositional Causal Reasoning Evaluation in Language Models

Authors: Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We deploy Alg. 1 to evaluate CCR in seven LM architectures, with and without chain-of-thought (CoT) prompting. Even on a simple CCR problem, our framework revealed taxonomically distinct patterns of inconsistent and invalid reasoning, ranging from II to VC.
Researcher Affiliation	Collaboration	1Cornell Tech 2Harvard University 3Microsoft Research Cambridge. Correspondence to: Jacqueline Maasch <EMAIL>.
Pseudocode	Yes	Section 5 suggests one possible procedure for assessing compositional consistency in causal reasoning (Alg. 1), based on compositional properties of the ATE and PNS in graphs with cutpoints.
Open Source Code	Yes	To facilitate future work, we provide open-source code for randomly generating qualitative and quantitative CCR tasks of scalable graphical complexity.2 Project page: https://jmaasch.github.io/ccr/
Open Datasets	No	The paper describes generating tasks and sampling data for experiments using code provided, rather than providing access to a static, pre-existing open dataset. For example, in Appendix F Automated Task Generator, it states: "Generate factual and counterfactual text prompts corresponding to the generated SCM and chosen theme. Variables are assigned random human names. ... Sample observational and interventional data from the SCM."
Dataset Splits	No	The paper describes a process of generating samples and subsampling for computing PNS estimates for evaluation, rather than providing fixed training/test/validation splits for a static dataset. "For each cause-effect pair, we sampled 1000 sets of exogenous variable values. ... 1000 factual responses and 1000 counterfactual responses were randomly subsampled (one per set of five replicate responses). The subsample of factual and counterfactual responses was then used to compute the PNS."
Hardware Specification	Yes	Model Inference We used a single A100 GPU for all experiments.
Software Dependencies	No	The paper mentions specific language models used (e.g., Llama 2, Llama 3, Phi-3-Mini, GPT-4o) and that Hugging Face hyperparameters were used, but it does not specify versions for core software components like Python, PyTorch, or other general libraries used for development and execution.
Experiment Setup	Yes	All {T } in the context prompt take a value of 7, such that all exogenous variables are drawn from Bernoulli distributions parameterized by p = 0.7. ... Reasoning was considered externally valid or internally consistent for a quantity if 90% of the 1000 estimates had RAE 0.1 (threshold chosen prior to analysis).