Causal Parrots: Large Language Models May Talk Causality But Are Not Causal

Authors: Matej Zečević, Moritz Willig, Devendra Singh Dhami, Kristian Kersting

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical analysis provides favoring evidence that current LLMs are even weak causal parrots. Following that, we provide an empirical analysis on the causal prowess of current LLMs and discuss the results in lights of our theoretical groundwork from the beginning, posing our second contribution.
Researcher Affiliation Academia Matej Zečević EMAIL Computer Science Department, TU Darmstadt, Germany, Moritz Willig EMAIL Computer Science Department, TU Darmstadt, Germany, Devendra Singh Dhami EMAIL Computer Science Department, TU Darmstadt, Germany Hessian Center for AI (hessian.AI), Germany, Kristian Kersting EMAIL Computer Science Department, TU Darmstadt, Germany Centre for Cognitive Science, TU Darmstadt, Germany Hessian Center for AI (hessian.AI), Germany German Research Center for Artificial Intelligence (DFKI), Germany
Pseudocode Yes Algorithm 1 Decisiveness of a (causal) graph prediction
Open Source Code Yes For reproduction purposes we make our code repository for the empirical part publicly available.2 2https://github.com/Moritz Willig/causal Parrots/ For reproduction of the the empirical part, our code is publicly available: https://github.com/Moritz Willig/causal Parrots/
Open Datasets Yes The Pile data set which was used for training OPT (Zhang et al., 2022b) comprised over 6 Gi B of textual data from Wikipedia (Gao et al., 2020). We consider publicly available data sets that propose a ground truth causal graph (which depicts the data generating process). We consider six data sets: altitude (A; Mooij et al. (2016)), health (H; Zečević et al. (2021)), recovery (R; Charig et al. (1986)), driving (D; synthetic), cancer (C) and earthquake (E), both (Korb & Nicholson, 2010). observing a high accuracy in the Tübingen cause-effect pairs data set by Mooij et al. (2016). The data set can be found at: https://webdav.tuebingen.mpg.de/ cause-effect/. Concept Net (Speer et al., 2017) is a knowledge graph combining multiple data sources, thus containing a large range of relational information.
Dataset Splits No The paper uses pre-trained LLMs and evaluates their performance on various tasks by querying them. It describes how queries are constructed or how few-shot examples are provided, but it does not specify traditional training, validation, and test splits for the data used in their own experimental evaluation in a way that would allow direct reproduction of the data partitioning.
Hardware Specification Yes The results in our paper were created on one NVIDIA A100-SXM4-80GB GPU with 80 GB of RAM and it takes 40 GPU minutes to query the OPT model.
Software Dependencies No The paper mentions specific large language models (GPT-3, Luminous, OPT, GPT-4) and using their APIs for some experiments, but it does not provide specific software libraries or frameworks with version numbers that were used for their experimental setup or analysis.
Experiment Setup Yes For propositional logic we consider 20 different questions such as for example If A causes B and B causes C does A cause C? . These questions are simply fed as prompt to the respective LLM. In this setup no prior world knowledge is required... We use five different query wordings (or formulations) such as Are X and Y causally related? or Does X cause Y ? (see appendix for full list). For any answer given by the LLMs we automatically classify answers starting with Yes or No accordingly and manually label the remaining ones. We run a nearest neighbour prediction (k-NN with k=1) with cosine similarity.