Gumbel Counterfactual Generation From Language Models

Authors: Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, Ryan Cotterell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
Researcher Affiliation Academia 1New York University 2ETH Zurich 3University of Copenhagen
Pseudocode Yes Algorithm 1 An algorithm that samples counterfactual strings given a factual string.
Open Source Code Yes Our code is available at https://github.com/shauli-ravfogel/lm-counterfactuals.
Open Datasets Yes We generate 500 sentences by using the first five words of randomly selected English Wikipedia sentences as prompts for the original model. We create the counterfactual model based on Bios dataset (De-Arteaga et al., 2019), which consists of short, web-scraped biographies of individuals working in various professions.
Dataset Splits Yes For each original and counterfactual model pair, we generate 500 sentences by using the first five words of randomly selected English Wikipedia sentences as prompts for the original model. We use 15,000 pairs of male and female biographies from the training set to fit the Mi Mi C optimal linear transformation
Hardware Specification Yes All models are run on 8 RTX-4096 GPUs and use 32-bit floating-point precision.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions models like GPT2-XL and LLa MA3-8b but no other software details.
Experiment Setup Yes We apply MEMIT on GPT2-XL model... we focus the intervention on layer 13 of the model... a KL factor of 0.0625, a weight decay of 0.5, and calculating the loss on layer 47. We fit the intervention on layer 16 of the residual steam of the model, chosen based on preliminary experiments, which showed promising results in changing the pronouns in text continuations from male to female.