reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Prompting: Debiasing Large Language Model Prompting Based on Front-Door Adjustment

Authors: Congzhi Zhang, Linhai Zhang, Jialong Wu, Yulan He, Deyu Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). 4.3 Main Results Table 1 shows the comparison results between causal prompting and the aforementioned baselines. Expectedly, the performance of Standard ICL, Co T, and Co T-SC improves progressively, as each subsequent method is an enhanced version of its predecessor. It not only confirms the effectiveness of integrating Co T into ICL, consistent with (Brown et al. 2020; Wei et al. 2022; Zhou et al. 2022), but also validates the efficacy of employing multiple sampling and voting strategies (Wang et al. 2022). Causal Prompting consistently delivers the best results across all metrics and datasets. Our experimental results demonstrate that Causal Prompting significantly improves performance across seven NLP tasks on both open-source and closed-source LLMs.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China 2Department of Informatics, King s College London, UK 3The Alan Turing Institute, UK
Pseudocode	No	The paper describes methods using equations and prose, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)).
Dataset Splits	Yes	We evaluate the effectiveness of our approach on three tasks: Math Reasoning (GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021)), Multi-hop Question Answering (Hotpot QA (Yang et al. 2018), Mu Si Que (Trivedi et al. 2022)), and Natural Language Understanding (Aspect-based Sentiment Analysis (ABSA) (Pontiki et al. 2016), Natural Language Inference (NLI) (Williams, Nangia, and Bowman 2017), and Fact Verification (FV) (Thorne et al. 2018)). For the NLU tasks, we use the original datasets (in-distribution, ID) and the corresponding adversarial datasets (out-of-distribution, OOD) (Wang et al. 2021) to verify the robustness of our method.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU types, or cloud resources with specifications).
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies or their version numbers required to reproduce the experiments.
Experiment Setup	No	The paper discusses aspects of the method such as using 'n in-context demonstrations' and 'm distinct Co Ts' and the K-means clustering algorithm, and 'increasing the temperature parameter of LLMs', but it does not provide specific hyperparameters for the training of the Encoder (e.g., learning rate, batch size, epochs, optimizer details) or other system-level training settings. Although it mentions 'temperature in the contrastive learning' for the Info NCE loss, a specific value is not provided.