reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fragments to Facts: Partial-Information Fragment Inference from LLMs

Authors: Lucas Rosenblatt, Bin Han, Robert Wolfe, Bill Howe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments in a medical summarization context, first fine-tuning an LLM for summarization on medical notes. We then reduce each note to a set of fragments (in this case, medical terms), and we simulate the efficacy of our attacks in the hands of an adversary who possesses only a small amount of information about an individual. Our experiments show that fine-tuned LLMs are vulnerable to extraction attacks under even these limited-information conditions; we observe a 9.5% TPR on Qwen-2-7B at 2% FPR using LR-Attack, and an 11.5% TPR on Llama-3-8B at 5% FPR using PRISM, for example.
Researcher Affiliation	Academia	1New York University, New York, USA 2University of Washington, Seattle, USA.
Pseudocode	Yes	Algorithm 1 A Class of PIFI Attack Models Input: Private fragment y , public fragment set A(s) = S for an individual, target language model fθ,D, shadow and world models fθ,D , fθ,world, decision threshold τ, Output: {0,1} 1: p D = fθ,D(y \| Prompt(S)) 2: p D = fθ,D (y \| Prompt(S)) 3: pworld = fθ,world(y \| Prompt(S)) 4: ℓ INFER([ p D, p D , pworld ]), where ℓscores the likelihood that y s given s D. 5: Return 1 [ℓ> τ].
Open Source Code	Yes	Code for this project is available at github.com/Bean Ham/fragments-to-facts/.
Open Datasets	Yes	We use the MTSDialog dataset (Abacha et al., 2023; Yim et al., 2023; Han et al., 2023), which includes 1,700 doctor-patient dialogues, with corresponding summaries. ... We use legal data from the Free Law project2, filtering for sentencing / criminal possession data using the built in Nomic topic modeling tool. ... 2https://huggingface.co/spaces/free-law/New_York_CAP
Dataset Splits	Yes	We filter out dialogues without any extracted entities, leaving us 948 train, 69 validation, and 312 test samples. ... Ultimately, we have 748 train, 188 validation, and 235 test samples.
Hardware Specification	No	The paper mentions 'high-VRAM GPUs' and the ability to fine-tune a '70B-parameter Llama model' given their 'compute constraints', but it does not specify the exact GPU models, CPU models, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions software components such as 'Light Gradient Boosting Machine (Light GBM) model' and 'dp-transformers library', but it does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We consider both models that have seen the data only once (i.e., undergone 1 epoch of finetuning) and models that saw the data repeatedly, until loss convergence (e.g., were fine-tuned for 10 or more epochs). ... Ten epochs, opacus set to achieve ϵ under (ϵ, 10 5) DP, uses dp-transformers library (Wutschitz et al., 2022).