Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery
Authors: Linan Yue, Qi Liu, Yichao Du, Li Wang, Weibo Gao, Yanqing An
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on real-world datasets clearly validate the effectiveness of our proposed method. Code is released at https://github.com/yuelinan/codes-of-SSR. |
| Researcher Affiliation | Academia | Linan Yue1, Qi Liu1,2 , Yichao Du1, Li Wang3, Weibo Gao1, Yanqing An1 1: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2: Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Hefei, China 3: Byte Dance EMAIL; EMAIL |
| Pseudocode | Yes | Algorithm 1 SSRunif: Injecting Shortcuts into Prediction. ... Algorithm 2 SSRvirt: Virtual Shortcuts Representations. ... Algorithm 3 Semantic Data Augmentation |
| Open Source Code | Yes | Code is released at https://github.com/yuelinan/codes-of-SSR. |
| Open Datasets | Yes | We evaluate SSR on text classification tasks from the ERASER benchmark (De Young et al., 2020), including Movies (Pang & Lee, 2004) for sentiment analysis, Multi RC (Khashabi et al., 2018) for multiple-choice QA, Bool Q (Clark et al., 2019) for reading comprehension, Evidence Inference (Lehman et al., 2019) for medical interventions, and FEVER (Thorne et al., 2018) for fact verification. Each dataset contains human annotated rationales and classification labels. |
| Dataset Splits | Yes | In the semi-supervised setting, we implement our SSR and other semi-supervised rationalization methods with 25% labeled rationales. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running its experiments, only mentioning the use of BERT as encoder. |
| Software Dependencies | No | The paper mentions using BERT and Adam W optimizer, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For training, we adopt the Adam W optimizer (Loshchilov & Hutter, 2019) with an initial learning rate as 2e-05, then we set the batch size as 4, maximum sequence length as 512 and training epoch as 30. Besides, we set the predefined sparsity level α as {0.1, 0.2, 0.2, 0.08} for Movies, Multi RC, Bool Q and Evidence Inference, respectively, which is slightly higher than the percentage of rationales in the input text. In the semi-supervised setting, we implement our SSR and other semi-supervised rationalization methods with 25% labeled rationales. In SSR, we set Lunif, Lvirt, and λdiff as 0.1, respectively. |