Causal Prompting: Debiasing Large Language Model Prompting Based on Front-Door Adjustment

Authors: Congzhi Zhang, Linhai Zhang, Jialong Wu, Yulan He, Deyu Zhou

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). 4.3 Main Results Table 1 shows the comparison results between causal prompting and the aforementioned baselines. Expectedly, the performance of Standard ICL, Co T, and Co T-SC improves progressively, as each subsequent method is an enhanced version of its predecessor. It not only confirms the effectiveness of integrating Co T into ICL, consistent with (Brown et al. 2020; Wei et al. 2022; Zhou et al. 2022), but also validates the efficacy of employing multiple sampling and voting strategies (Wang et al. 2022). Causal Prompting consistently delivers the best results across all metrics and datasets. Our experimental results demonstrate that Causal Prompting significantly improves performance across seven NLP tasks on both open-source and closed-source LLMs.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China 2Department of Informatics, King s College London, UK 3The Alan Turing Institute, UK
Pseudocode No The paper describes methods using equations and prose, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)).
Dataset Splits Yes We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). For the NLU tasks, we use the original datasets (in-distribution, ID) and the corresponding adversarial datasets (out-of-distribution, OOD) (Wang et al. 2021) to verify the robustness of our method.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources with specifications).
Software Dependencies No The paper does not explicitly mention any specific software dependencies or their version numbers required to reproduce the experiments.
Experiment Setup No The paper discusses aspects of the method such as using 'n in-context demonstrations' and 'm distinct Co Ts' and the K-means clustering algorithm, and 'increasing the temperature parameter of LLMs', but it does not provide specific hyperparameters for the training of the Encoder (e.g., learning rate, batch size, epochs, optimizer details) or other system-level training settings. Although it mentions 'temperature in the contrastive learning' for the Info NCE loss, a specific value is not provided.