reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Iterative Counterfactual Data Augmentation

Authors: Mitchell Plyler, Min Chi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments include six human produced datasets and two large-language model generated datasets. We show training on the augmented datasets produces rationales on documents that better align with human annotation. Table 1 shows the rationale precision results on the human generated and annotated test datasets. Table 2 shows the results on the LLM generated datasets. Our method, ICDA, showed an improvement over all baselines.
Researcher Affiliation	Academia	Mitchell Plyler Min Chi Department of Computer Science, North Carolina State University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Iterative CDA Procedure Require: D is a dataset with documents X and labels Y . D D while not converged do S train selector(D ) Dc infer counterfactuals(D, S) Da concatenate(D, Dc) D Da end while
Open Source Code	Yes	Code https://github.com/mlplyler/ICDA
Open Datasets	Yes	In this work, we adopt two common benchmark datasets for the rationale problem: Rate Beer (Mc Auley, Leskovec, and Jurafsky 2012) and Tripadvisor (Wang, Lu, and Zhai 2010). The dataset was originally curated by (Wang, Lu, and Zhai 2010) and human-labeled rationales were collected by (Bao et al. 2018). We also evaluated ICDA on two LLM-generated datasets. Exact details for generating the datasets, and statistics relating to all datasets, are in Appendix Data.
Dataset Splits	No	The annotations are strictly in the test split of the data and are not used for hyper-parameter tuning. Again, these rationales are in the test set only. Table 1 shows the rationale precision results on the human generated and annotated test datasets.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or specific computer specifications) were found in the paper.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) were found in the paper.
Experiment Setup	Yes	In our implementation, we train three rationale selectors by varying the random seed across the three runs. Our experiments use one sentence as the rationale. We limit our augmented dataset size to be of equal size to the original dataset. Motivated by the analysis on helpful β errors, we know during counterfactual generation, it is helpful to insert a rationale of the correct aspect even if the rationale on the original document was incorrect. We therefore limit our rationale set for counterfactual generation A to be sourced from only documents where we made a correct prediction, and we take the 10% rationales, per class, where the model was most confident in correct prediction. We always trained a model to convergence on each augmented dataset.