reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Differentially Private Steering for Large Language Model Alignment

Authors: Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (Lla Ma, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing.
Researcher Affiliation	Academia	Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt www.ukp.tu-darmstadt.de Max Planck Institute for Intelligent Systems, Tübingen, Germany EMAIL Department of Computer Science, University of Copenhagen, Denmark EMAIL
Pseudocode	Yes	Algorithm 1 Generating private steering vectors Input: A set of selected layers S, private demonstrations Dpriv = {(pi, c+ i , c i )}n i=1, and privacy parameters ε, δ. For l S, last-token activation extraction function hl and constant threshold Cl. Algorithm 2 Privately steered generation Input: A set of selected layers S, private steering vectors vpriv l for selected layers S, and activations of the user query ht,l for each token t [T] and for all layers l [L]. Algorithm 3 Membership Inference Attack with Canaries Require: Set of canary tokens S, MIA threshold τ, the language model under attack M
Open Source Code	Yes	Our code is available at https://github.com/UKPLab/iclr2025-psa/
Open Datasets	Yes	We use the evaluation benchmark datasets proposed in Anthropic s Advanced AI Risk humanwritten evaluation (Perez et al., 2023) and curated by Rimsky et al. (2024). These datasets cover several LLM alignment relevant behaviors with multiple choice questions with two answer options one that demonstrates the behavior of interest (c+) and the opposite (c ).
Dataset Splits	No	The paper uses
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA A100 80GB GPU.
Software Dependencies	No	The paper mentions specific LLM models (e.g., Llama-2 (7B), Mistral-v0.3 (7B), Gemma-2 (2B), Qwen-2.5 (7B)) and GPT-4 as an evaluator, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use the chat template specific to each model for all our experiments. The noisy vectors are generated by adding Gaussian noise with 0.02 standard deviation. We fix δ = 1 / 5n. We evaluate all models on positive behavioral steering (λ = 1). We choose τ = 40 for Llama-2 and τ = 70 for Qwen-2.5