reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing NLU in Large Language Models Using Adversarial Noisy Instruction Tuning

Authors: Shengyuan Bai, Qibin Li, Zhe Wang, Nai Zhou, Nianmin Yao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach across diverse noisy instructions and semantic distortion quantification methods on multiple NLU tasks. Comprehensive empirical results demonstrate that our method consistently outperforms existing approaches across various experimental settings.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Dalian University of Technology 2International Digital Economy Academy (IDEA) 3Hong Kong University of Science and Technology 4Quan Cheng Laboratory
Pseudocode	Yes	Algorithm 1: Noise Response Method for AT
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions using third-party models like Open AI API's text-embedding-3-large but does not offer its own implementation code.
Open Datasets	Yes	We conduct experiments on four representative NLU tasks: Named Entity Recognition (NER), Relationship Extraction (RE), Text Classification (TC) and Aspect-based Sentiment Analysis (ABSA). For each task, we employ two datasets: Ontonotes (Hovy et al. 2006) and Co NLL2003 (Tjong Kim Sang and De Meulder 2003) for NER; Sci ERC (Luan et al. 2018) and NYT (Riedel, Yao, and Mc Callum 2010) for RE; SST2 (Socher et al. 2013) and AGNews (Zhang, Zhao, and Le Cun 2015) for TC; 14Lap and 14Rest (Xu et al. 2020) for ABSA. We collect instructions for each dataset from Alpaca (Taori et al. 2023).
Dataset Splits	Yes	We conducted a cross-evaluation wherein an LLM trained on one task was subsequently trained and tested with a limited number of samples on another task. Specifically, a model trained on the Co NLL2003 dataset was retrained and tested with few samples from the Ontonotes dataset. The results, detailed in Table 4, show the F1 scores on these tasks using LLa MA2-7B. Across various few-shot training data scenarios, the model consistently outperforms the baseline. Notably, when using only 5% of the training data, the LLM demonstrates an average performance improvement of 2.47%.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using specific LLMs (Gemma-2B, LLa MA2-7B, LLa MA3-8B) and a fine-tuning method (Lo RA), but does not provide version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	In our experiments, we employ greedy decoding for these models. To evaluate the effectiveness of ANIT on all models, we fine-tune models using Lo RA (Hu et al. 2022), a parameter-efficient fine-tuning method. The hyperparameter α, which acts as the adversarial coefficient, is employed to regulate the extent of the perturbation applied during training. λ is a hyperparameter to adjust the loss. While we typically use ω values between 0.1 and 0.3, we assessed ANIT with ω ranging from 0.1 to 0.5 in 0.1 increments.