Gumbel Counterfactual Generation From Language Models
Authors: Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, Ryan Cotterell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects. |
| Researcher Affiliation | Academia | 1New York University 2ETH Zurich 3University of Copenhagen |
| Pseudocode | Yes | Algorithm 1 An algorithm that samples counterfactual strings given a factual string. |
| Open Source Code | Yes | Our code is available at https://github.com/shauli-ravfogel/lm-counterfactuals. |
| Open Datasets | Yes | We generate 500 sentences by using the first five words of randomly selected English Wikipedia sentences as prompts for the original model. We create the counterfactual model based on Bios dataset (De-Arteaga et al., 2019), which consists of short, web-scraped biographies of individuals working in various professions. |
| Dataset Splits | Yes | For each original and counterfactual model pair, we generate 500 sentences by using the first five words of randomly selected English Wikipedia sentences as prompts for the original model. We use 15,000 pairs of male and female biographies from the training set to fit the Mi Mi C optimal linear transformation |
| Hardware Specification | Yes | All models are run on 8 RTX-4096 GPUs and use 32-bit floating-point precision. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions models like GPT2-XL and LLa MA3-8b but no other software details. |
| Experiment Setup | Yes | We apply MEMIT on GPT2-XL model... we focus the intervention on layer 13 of the model... a KL factor of 0.0625, a weight decay of 0.5, and calculating the loss on layer 47. We fit the intervention on layer 16 of the residual steam of the model, chosen based on preliminary experiments, which showed promising results in changing the pronouns in text continuations from male to female. |