reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can a Large Language Model be a Gaslighter?

Authors: Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, Yang You

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that both prompt-based and fine-tuningbased attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. ... 4 EXPERIMENTS We utilized prompt-based attack to evaluate the gaslighting harmfulness of LLMs (base, gaslightingfine-tuned and anti-gaslighting safety aligned LLMs).
Researcher Affiliation	Collaboration	1School of Computer Science, National University of Singapore, Singapore 2AI Singapore, Singapore 3College of Computing and Data Science, Nanyang Technological University, Singapore
Pseudocode	No	The paper describes methodologies like Deep Co G, Deep Gaslighting, and Chain-of-Gaslighting using prompt templates (e.g., 'Deep Gaslighting Prompt Template', 'Chain-of-Gaslighting Prompt Template'), but it does not present structured pseudocode or algorithm blocks with numbered steps in a dedicated section or figure.
Open Source Code	Yes	Codes and datasets are available at https://github.com/Maxwe11y/gaslighting LLM.
Open Datasets	Yes	Codes and datasets are available at https://github.com/Maxwe11y/gaslighting LLM. Researchers should use datasets with caution and avoid unwarranted dissemination. An alternative use of the dataset is available on Hugging Face https://huggingface.co/datasets/ Maxwe11y/gaslighting.
Dataset Splits	Yes	We employ spectral clustering (Bianchi et al., 2020) to partition the 2k dataset into training, validation, and test sets. The partition is designed to ensure that the three sets have minimal overlap with each other. ... The dataset statistics are in Table 1. Table 1: The statistics of the gaslighting dataset. Training 1752 Validation 124 Test 124 All 2000
Hardware Specification	Yes	For the fine-tuning-based attack and safety alignment, we used NVIDIA RTX A40 with 48G VRAM for computation.
Software Dependencies	Yes	The experiment utilized the gpt-3.5-turbo-0125 version of Chat GPT and the gpt-4-turbo-preview version of GPT-4.
Experiment Setup	Yes	In particular, we set Lo RA rank, Lo RA alpha, and Lo RA dropout to 8, 16, and 0.05 respectively for all LLMs. The learning rate is set to 2e 4 for SFT and 5e 7 for DPO 10. β is set to 0.05 for DPO. ... We set batch size and gradient accumulation step to {1, 2, 2} and {1, 2, 2} respectively for the first two strategies and the SFT stage of S3. For the DPO stage in S3, we set batch size and gradient accumulation step to 4 and 4 respectively.