Differentially Private Steering for Large Language Model Alignment

Authors: Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (Lla Ma, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing.
Researcher Affiliation Academia Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt www.ukp.tu-darmstadt.de Max Planck Institute for Intelligent Systems, Tübingen, Germany EMAIL Department of Computer Science, University of Copenhagen, Denmark EMAIL
Pseudocode Yes Algorithm 1 Generating private steering vectors Input: A set of selected layers S, private demonstrations Dpriv = {(pi, c+ i , c i )}n i=1, and privacy parameters ε, δ. For l S, last-token activation extraction function hl and constant threshold Cl. Algorithm 2 Privately steered generation Input: A set of selected layers S, private steering vectors vpriv l for selected layers S, and activations of the user query ht,l for each token t [T] and for all layers l [L]. Algorithm 3 Membership Inference Attack with Canaries Require: Set of canary tokens S, MIA threshold τ, the language model under attack M
Open Source Code Yes Our code is available at https://github.com/UKPLab/iclr2025-psa/
Open Datasets Yes We use the evaluation benchmark datasets proposed in Anthropic s Advanced AI Risk humanwritten evaluation (Perez et al., 2023) and curated by Rimsky et al. (2024). These datasets cover several LLM alignment relevant behaviors with multiple choice questions with two answer options one that demonstrates the behavior of interest (c+) and the opposite (c ).
Dataset Splits No The paper uses
Hardware Specification Yes All experiments were conducted on a single NVIDIA A100 80GB GPU.
Software Dependencies No The paper mentions specific LLM models (e.g., Llama-2 (7B), Mistral-v0.3 (7B), Gemma-2 (2B), Qwen-2.5 (7B)) and GPT-4 as an evaluator, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We use the chat template specific to each model for all our experiments. The noisy vectors are generated by adding Gaussian noise with 0.02 standard deviation. We fix δ = 1 / 5n. We evaluate all models on positive behavioral steering (λ = 1). We choose τ = 40 for Llama-2 and τ = 70 for Qwen-2.5