reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models are Advanced Anonymizers

Authors: Robin Staab, Mark Vero, Mislav Balunovic, Martin Vechev

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive experimental evaluation of adversarial anonymization across 13 LLMs on real-world and synthetic online texts, comparing it against multiple baselines and industry-grade anonymizers. Our evaluation shows that adversarial anonymization outperforms current commercial anonymizers both in terms of the resulting utility and privacy. We support our findings with a human study (n=50) highlighting a strong and consistent human preference for LLM-anonymized texts.
Researcher Affiliation	Academia	Robin Staab, Mark Vero, Mislav Balunovi c, Martin Vechev Department of Computer Science, ETH Zurich EMAIL
Pseudocode	No	The paper describes the adversarial anonymization process in Section 4 and illustrates it with diagrams in Figures 1 and 2, but it does not present a formal pseudocode block or algorithm.
Open Source Code	Yes	We provide the source code of our method and all experiment scripts in the following code repository: https://github.com/eth-sri/llm-anonymization.
Open Datasets	Yes	To compensate for this, the authors of Staab et al. (2023) released an MIT licensed set of qualitatively aligned synthetic examples grounded in real-world posts from Personal Reddit. This set contains 525 samples... For a full overview of the Synth PAI dataset, including attribute and hardness distributions, we refer directly to Yukhymenko et al. (2024). Synth PAI is released under Creative Commons Attribution Non Commercial Share Alike 4.0 license.
Dataset Splits	No	The paper mentions preprocessing Personal Reddit to 426 profiles and using a "1000 sample test set" for the Med QA dataset. However, it does not explicitly provide the training/test/validation splits (e.g., percentages or exact counts for all datasets) required to reproduce the data partitioning for the main experiments.
Hardware Specification	Yes	All runs of Yi-34B were conducted on a single H100 GPU (80GB) of VRAM. Our compute node had 26 cores and 200GB of RAM.
Software Dependencies	Yes	GPT-3.5: We use GPT-3.5 in version gpt-3.5-turbo-16k-0613 supplied by Open AI. Additionally, we set the temperature to 0.1 across all runs. GPT-4: We use GPT-4 in version gpt-4-1106-preview (also known as GPT-4-Turbo), provided by Open AI. We set the temperature to 0.1 across all runs... In particular, we finetuned a 4-bit quantized version of Llama-3.1-8B using (r=256 Lo RA adapters, three epochs, using unsloth (Han & Han, 2024)) as a new inference model.
Experiment Setup	Yes	Additionally, we set the temperature to 0.1 across all runs. For inference, we use the adversarial prompts introduced in Staab et al. (2023), letting GPT-4 infer personal attributes from a complete user profile in a zero-shot Co T fashion. We set the corresponding hyperaparameters to lex_div=60 and ord_div=20, a setting used in prior work such as Jovanovi c et al. (2024). In particular, we finetuned a 4-bit quantized version of Llama-3.1-8B using (r=256 Lo RA adapters, three epochs, using unsloth (Han & Han, 2024)) as a new inference model.