reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hidden No More: Attacking and Defending Private Third-Party LLM Inference

Authors: Rahul Krishna Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct our experiments on two state-of-the-art open-source LLMs, Gemma-2-2B-IT (Team et al., 2024) and Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We test on samples from the Fineweb-Edu dataset (Penedo et al., 2024). We evaluate on 1000 held out prompts, and our results are shown in Table 1.
Researcher Affiliation	Collaboration	1Ritual AI 2Stanford University 3Columbia University. Correspondence to: Arka Pal (Project Lead) <EMAIL>.
Pseudocode	Yes	Algorithm 1 Vocabulary-Matching Attack Algorithm 2 Cascade Single Layer Forward Pass Algorithm 3 Generalized Vocabulary-Matching Attack Algorithm 4 Attack on Sequence Dimension Permuted LLM Hidden States Algorithm 5 Attack on Hidden Dimension Permuted LLM Hidden States Algorithm 6 Attack on Factorized-2D Permuted LLM Hidden States Algorithm 7 Comp Nodei Single Layer Pre-Pass Algorithm 8 Attn Nodejk Single Layer Attention-Pass Algorithm 9 Comp Nodei Single Layer Post-Pass
Open Source Code	Yes	Our implementation is available at https://github. com/ritual-net/vma-external.
Open Datasets	Yes	We conduct our experiments on two state-of-the-art open-source LLMs, Gemma-2-2B-IT (Team et al., 2024) and Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We test on samples from the Fineweb-Edu dataset (Penedo et al., 2024).
Dataset Splits	Yes	For each layer of interest, we tune ϵ by performing a ternary search on a small training set of 50 prompts from Fine Web, to determine the optimal L1-threshold under which predicted tokens are accepted as matches. We evaluate on 1000 held out prompts, and our results are shown in Table 1.
Hardware Specification	Yes	We run our experiments on Paperspace machines with 16 v CPU and 64GB RAM, with the CPU model being Intel Xeon Gold 6226R CPU @ 2.90GHz. All machines are colocated in the same region with an average bandwidth of 2 Gbps and latency of 0.38 ms.
Software Dependencies	No	We benchmark against two recent SMPC schemes for LLM inference, MPCFormer (Li et al., 2023a) and Puma (Dong et al., 2023b). For MPCFormer, we modify the Crypten implementation to use public rather than private weights, to match our open-weights setting. Puma data is taken from Dong et al. (2023b), as it is built on SPU with its own set of optimizations. We also mention the bitsandbytes library (Bits And Bytes, 2025) and Ray (Moritz et al., 2018). However, no specific version numbers are provided for these software dependencies.
Experiment Setup	Yes	For each layer of interest, we tune ϵ by performing a ternary search on a small training set of 50 prompts from Fine Web, to determine the optimal L1-threshold under which predicted tokens are accepted as matches. We evaluate on 1000 held out prompts, and our results are shown in Table 1. Due to computational constraints, each evaluation prompt was truncated to a maximum of 50 tokens; however, small-scale experiments with prompts over 200 tokens demonstrated that our results generalize to longer prompt settings vocab-matching still perfectly decodes hidden states into their input tokens.