reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery

Authors: Linan Yue, Qi Liu, Yichao Du, Li Wang, Weibo Gao, Yanqing An

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on real-world datasets clearly validate the effectiveness of our proposed method. Code is released at https://github.com/yuelinan/codes-of-SSR.
Researcher Affiliation	Academia	Linan Yue1, Qi Liu1,2 , Yichao Du1, Li Wang3, Weibo Gao1, Yanqing An1 1: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2: Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center Hefei, China 3: Byte Dance EMAIL; EMAIL
Pseudocode	Yes	Algorithm 1 SSRunif: Injecting Shortcuts into Prediction. ... Algorithm 2 SSRvirt: Virtual Shortcuts Representations. ... Algorithm 3 Semantic Data Augmentation
Open Source Code	Yes	Code is released at https://github.com/yuelinan/codes-of-SSR.
Open Datasets	Yes	We evaluate SSR on text classiﬁcation tasks from the ERASER benchmark (De Young et al., 2020), including Movies (Pang & Lee, 2004) for sentiment analysis, Multi RC (Khashabi et al., 2018) for multiple-choice QA, Bool Q (Clark et al., 2019) for reading comprehension, Evidence Inference (Lehman et al., 2019) for medical interventions, and FEVER (Thorne et al., 2018) for fact veriﬁcation. Each dataset contains human annotated rationales and classiﬁcation labels.
Dataset Splits	Yes	In the semi-supervised setting, we implement our SSR and other semi-supervised rationalization methods with 25% labeled rationales.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running its experiments, only mentioning the use of BERT as encoder.
Software Dependencies	No	The paper mentions using BERT and Adam W optimizer, but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For training, we adopt the Adam W optimizer (Loshchilov & Hutter, 2019) with an initial learning rate as 2e-05, then we set the batch size as 4, maximum sequence length as 512 and training epoch as 30. Besides, we set the predeﬁned sparsity level α as {0.1, 0.2, 0.2, 0.08} for Movies, Multi RC, Bool Q and Evidence Inference, respectively, which is slightly higher than the percentage of rationales in the input text. In the semi-supervised setting, we implement our SSR and other semi-supervised rationalization methods with 25% labeled rationales. In SSR, we set Lunif, Lvirt, and λdiff as 0.1, respectively.