Enhancing NLU in Large Language Models Using Adversarial Noisy Instruction Tuning

Authors: Shengyuan Bai, Qibin Li, Zhe Wang, Nai Zhou, Nianmin Yao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach across diverse noisy instructions and semantic distortion quantification methods on multiple NLU tasks. Comprehensive empirical results demonstrate that our method consistently outperforms existing approaches across various experimental settings.
Researcher Affiliation Academia 1School of Computer Science and Technology, Dalian University of Technology 2International Digital Economy Academy (IDEA) 3Hong Kong University of Science and Technology 4Quan Cheng Laboratory
Pseudocode Yes Algorithm 1: Noise Response Method for AT
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using third-party models like Open AI API's text-embedding-3-large but does not offer its own implementation code.
Open Datasets Yes We conduct experiments on four representative NLU tasks: Named Entity Recognition (NER), Relationship Extraction (RE), Text Classification (TC) and Aspect-based Sentiment Analysis (ABSA). For each task, we employ two datasets: Ontonotes (Hovy et al. 2006) and Co NLL2003 (Tjong Kim Sang and De Meulder 2003) for NER; Sci ERC (Luan et al. 2018) and NYT (Riedel, Yao, and Mc Callum 2010) for RE; SST2 (Socher et al. 2013) and AGNews (Zhang, Zhao, and Le Cun 2015) for TC; 14Lap and 14Rest (Xu et al. 2020) for ABSA. We collect instructions for each dataset from Alpaca (Taori et al. 2023).
Dataset Splits Yes We conducted a cross-evaluation wherein an LLM trained on one task was subsequently trained and tested with a limited number of samples on another task. Specifically, a model trained on the Co NLL2003 dataset was retrained and tested with few samples from the Ontonotes dataset. The results, detailed in Table 4, show the F1 scores on these tasks using LLa MA2-7B. Across various few-shot training data scenarios, the model consistently outperforms the baseline. Notably, when using only 5% of the training data, the LLM demonstrates an average performance improvement of 2.47%.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using specific LLMs (Gemma-2B, LLa MA2-7B, LLa MA3-8B) and a fine-tuning method (Lo RA), but does not provide version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes In our experiments, we employ greedy decoding for these models. To evaluate the effectiveness of ANIT on all models, we fine-tune models using Lo RA (Hu et al. 2022), a parameter-efficient fine-tuning method. The hyperparameter α, which acts as the adversarial coefficient, is employed to regulate the extent of the perturbation applied during training. λ is a hyperparameter to adjust the loss. While we typically use ω values between 0.1 and 0.3, we assessed ANIT with ω ranging from 0.1 to 0.5 in 0.1 increments.