Causal Prompting: Debiasing Large Language Model Prompting Based on Front-Door Adjustment
Authors: Congzhi Zhang, Linhai Zhang, Jialong Wu, Yulan He, Deyu Zhou
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). 4.3 Main Results Table 1 shows the comparison results between causal prompting and the aforementioned baselines. Expectedly, the performance of Standard ICL, Co T, and Co T-SC improves progressively, as each subsequent method is an enhanced version of its predecessor. It not only confirms the effectiveness of integrating Co T into ICL, consistent with (Brown et al. 2020; Wei et al. 2022; Zhou et al. 2022), but also validates the efficacy of employing multiple sampling and voting strategies (Wang et al. 2022). Causal Prompting consistently delivers the best results across all metrics and datasets. Our experimental results demonstrate that Causal Prompting significantly improves performance across seven NLP tasks on both open-source and closed-source LLMs. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China 2Department of Informatics, King s College London, UK 3The Alan Turing Institute, UK |
| Pseudocode | No | The paper describes methods using equations and prose, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). |
| Dataset Splits | Yes | We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). For the NLU tasks, we use the original datasets (in-distribution, ID) and the corresponding adversarial datasets (out-of-distribution, OOD) (Wang et al. 2021) to verify the robustness of our method. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources with specifications). |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or their version numbers required to reproduce the experiments. |
| Experiment Setup | No | The paper discusses aspects of the method such as using 'n in-context demonstrations' and 'm distinct Co Ts' and the K-means clustering algorithm, and 'increasing the temperature parameter of LLMs', but it does not provide specific hyperparameters for the training of the Encoder (e.g., learning rate, batch size, epochs, optimizer details) or other system-level training settings. Although it mentions 'temperature in the contrastive learning' for the Info NCE loss, a specific value is not provided. |