reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Auditing Prompt Caching in Language Model APIs

Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including Open AI, resulting in potential privacy leakage about users prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that Open AI s embedding model is a decoder-only Transformer, which was previously not publicly known.
Researcher Affiliation	Academia	1Stanford University. Correspondence to: Chenchen Gu <EMAIL>, Tatsunori Hashimoto <EMAIL>.
Pseudocode	No	The paper describes the audit procedures step-by-step in paragraph form (e.g., Section 3.2 'Audit Implementation Details') rather than using formal pseudocode blocks or algorithms.
Open Source Code	Yes	We release code and data at https://github.com/chenchenygu/auditing-prompt-caching.
Open Datasets	Yes	We release code and data at https://github.com/chenchenygu/auditing-prompt-caching. Our distribution P of prompts is a uniform distribution over all prompts consisting of PROMPTLENGTH English letters, lowercase and uppercase, each separated by space characters, e.g., m x N j R .
Dataset Splits	No	The paper describes how samples for statistical testing are collected (`NUMSAMPLES = 250` for both cache hit and cache miss procedures) and randomized, but it does not involve traditional training/test/validation splits as would be used for training a machine learning model. The experiment is an audit, not a model training task.
Hardware Specification	No	The paper mentions 'GPU processing time' and 'different GPU models' in the context of general LLM operation, but it does not provide specific hardware details (like model numbers or types of GPUs/CPUs) used for running their own experiments. It only states that 'clients located in California' were used for the audits.
Software Dependencies	No	We use the Sci Py implementation (Virtanen et al., 2020) of the two-sample Kolmogorov-Smirnov (KS) test. While SciPy is mentioned, a specific version number for the SciPy library itself is not provided.
Experiment Setup	Yes	The procedure uses the following configuration parameters: PROMPTLENGTH, PREFIXFRACTION, NUMVICTIMREQUESTS, and NUMSAMPLES. For our audits, we use PROMPTLENGTH = 5000 and NUMSAMPLES = 250. We use a significance level of α = 10 8. In the remaining three levels, we test for prompt caching when x and x have the same prefix but different suffixes by setting PREFIXFRACTION = 0.95. To determine how many victim requests are needed to detect caching, we run tests using NUMVICTIMREQUESTS {1, 5, 25} in increasing order, stopping after the first significant p-value.