Iterative Counterfactual Data Augmentation

Authors: Mitchell Plyler, Min Chi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments include six human produced datasets and two large-language model generated datasets. We show training on the augmented datasets produces rationales on documents that better align with human annotation. Table 1 shows the rationale precision results on the human generated and annotated test datasets. Table 2 shows the results on the LLM generated datasets. Our method, ICDA, showed an improvement over all baselines.
Researcher Affiliation Academia Mitchell Plyler Min Chi Department of Computer Science, North Carolina State University EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Iterative CDA Procedure Require: D is a dataset with documents X and labels Y . D D while not converged do S train selector(D ) Dc infer counterfactuals(D, S) Da concatenate(D, Dc) D Da end while
Open Source Code Yes Code https://github.com/mlplyler/ICDA
Open Datasets Yes In this work, we adopt two common benchmark datasets for the rationale problem: Rate Beer (Mc Auley, Leskovec, and Jurafsky 2012) and Tripadvisor (Wang, Lu, and Zhai 2010). The dataset was originally curated by (Wang, Lu, and Zhai 2010) and human-labeled rationales were collected by (Bao et al. 2018). We also evaluated ICDA on two LLM-generated datasets. Exact details for generating the datasets, and statistics relating to all datasets, are in Appendix Data.
Dataset Splits No The annotations are strictly in the test split of the data and are not used for hyper-parameter tuning. Again, these rationales are in the test set only. Table 1 shows the rationale precision results on the human generated and annotated test datasets.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or specific computer specifications) were found in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were found in the paper.
Experiment Setup Yes In our implementation, we train three rationale selectors by varying the random seed across the three runs. Our experiments use one sentence as the rationale. We limit our augmented dataset size to be of equal size to the original dataset. Motivated by the analysis on helpful β errors, we know during counterfactual generation, it is helpful to insert a rationale of the correct aspect even if the rationale on the original document was incorrect. We therefore limit our rationale set for counterfactual generation A to be sourced from only documents where we made a correct prediction, and we take the 10% rationales, per class, where the model was most confident in correct prediction. We always trained a model to convergence on each augmented dataset.