Can a Large Language Model be a Gaslighter?

Authors: Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, Yang You

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that both prompt-based and fine-tuningbased attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. ... 4 EXPERIMENTS We utilized prompt-based attack to evaluate the gaslighting harmfulness of LLMs (base, gaslightingfine-tuned and anti-gaslighting safety aligned LLMs).
Researcher Affiliation Collaboration 1School of Computer Science, National University of Singapore, Singapore 2AI Singapore, Singapore 3College of Computing and Data Science, Nanyang Technological University, Singapore
Pseudocode No The paper describes methodologies like Deep Co G, Deep Gaslighting, and Chain-of-Gaslighting using prompt templates (e.g., 'Deep Gaslighting Prompt Template', 'Chain-of-Gaslighting Prompt Template'), but it does not present structured pseudocode or algorithm blocks with numbered steps in a dedicated section or figure.
Open Source Code Yes Codes and datasets are available at https://github.com/Maxwe11y/gaslighting LLM.
Open Datasets Yes Codes and datasets are available at https://github.com/Maxwe11y/gaslighting LLM. Researchers should use datasets with caution and avoid unwarranted dissemination. An alternative use of the dataset is available on Hugging Face https://huggingface.co/datasets/ Maxwe11y/gaslighting.
Dataset Splits Yes We employ spectral clustering (Bianchi et al., 2020) to partition the 2k dataset into training, validation, and test sets. The partition is designed to ensure that the three sets have minimal overlap with each other. ... The dataset statistics are in Table 1. Table 1: The statistics of the gaslighting dataset. Training 1752 Validation 124 Test 124 All 2000
Hardware Specification Yes For the fine-tuning-based attack and safety alignment, we used NVIDIA RTX A40 with 48G VRAM for computation.
Software Dependencies Yes The experiment utilized the gpt-3.5-turbo-0125 version of Chat GPT and the gpt-4-turbo-preview version of GPT-4.
Experiment Setup Yes In particular, we set Lo RA rank, Lo RA alpha, and Lo RA dropout to 8, 16, and 0.05 respectively for all LLMs. The learning rate is set to 2e 4 for SFT and 5e 7 for DPO 10. β is set to 0.05 for DPO. ... We set batch size and gradient accumulation step to {1, 2, 2} and {1, 2, 2} respectively for the first two strategies and the SFT stage of S3. For the DPO stage in S3, we set batch size and gradient accumulation step to 4 and 4 respectively.