Are Large Language Models Really Robust to Word-Level Perturbations?
Authors: Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical experiments demonstrate that TREva L provides an identification for the lack of robustness of nowadays LLMs. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during the alignment process. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper describes the methodology in prose and figures (e.g., Figure 1 shows a workflow), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "We utilize Beavertail (Ji et al., 2023) s open-source Reward Model, Cost Model and Armo RM-LLa MA3-8B-v0.1 reward model as referees in this exploration." This refers to using third-party open-source models, not the authors' own implementation code for the TREva L pipeline. There is no explicit statement about releasing their own code or a link to a code repository for the methodology described in the paper. |
| Open Datasets | Yes | We select 1k open questions from Natural Questions datasets (Kwiatkowski et al., 2019)... To this end, we select a subset of 1,000 prompts from the Natural Questions Dataset (Kwiatkowski et al., 2019) and Alpagasus Dataset (Chen et al., 2023b) as open questions... PKURLHF (Dai et al., 2023) is a dataset composing of several toxic questions and answers. |
| Dataset Splits | Yes | We filter the questions that have ground truth labels and then select 1k prompts from a 5.6k set to best leverage the generative capabilities of LLMs... Perturbation Level We employ three levels of perturbation, with a higher level conducting more substantial perturbations to the sentence. Specifically, level 1, level 2, and level 3 perturb 10%, 20%, and 33% of the sentence, respectively... We sample a 40 prompt size validation set and use the GPT-4 (Cha, 2023) to judge if any perturbation type or level conducted in this study leads to serious semantics deviation... Additionally, we supplement two extra level perturbations comparing to the levels in the main text. Specifically, compared to the three Levels(10%, 20%, 33%) of perturbations used in the main text, we also introduced 25% and 15% perturbations to comprehensively analyze the robustness of the model. We name the 15% perturbations of input prompt Level 1.5, because its range of disturbance lies between Level 1 and Level 2. The 25% perturbations is named Level 2.5. |
| Hardware Specification | No | The paper discusses various LLMs and their parameter sizes (e.g., LLaMA-7B, LLaMA2-chat-70B) and mentions training stages (Pre-trained, SFT, RLHF), but does not specify the hardware (e.g., specific GPU models, CPUs) used to run the experiments and evaluations described in the paper. |
| Software Dependencies | No | The paper mentions several models and tools by name, such as "BERT", "GPT", "GPT-4", "Beaver-7B Reward Model", "Beaver-7B Cost Model", and "Armo RM-LLaMA3-8B-v0.1 reward model". However, it does not provide specific version numbers for any of these software components or any other ancillary libraries or programming languages used in the implementation of their proposed TREva L framework. |
| Experiment Setup | Yes | Table 2: Metrics of the experiments, including the detailed information and settings of the experiments. Settings Parameters LLMs LLa MA/2/2-chat, Alpaca, Beaver (7B)/LLa MA2-chat (13B, 70B) Prompts Format BEGINNING OF CONVERSATION: USER: PROMPTS ASSISTANT: Dataset Selected Natural Questions Dataset/Alpagasus Dataset Perturbation Level Level 1/2/3 Perturbation Type Misspelling, Swapping, Synonym |