Auditing Prompt Caching in Language Model APIs

Authors: Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including Open AI, resulting in potential privacy leakage about users prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that Open AI s embedding model is a decoder-only Transformer, which was previously not publicly known.
Researcher Affiliation Academia 1Stanford University. Correspondence to: Chenchen Gu <EMAIL>, Tatsunori Hashimoto <EMAIL>.
Pseudocode No The paper describes the audit procedures step-by-step in paragraph form (e.g., Section 3.2 'Audit Implementation Details') rather than using formal pseudocode blocks or algorithms.
Open Source Code Yes We release code and data at https://github.com/chenchenygu/auditing-prompt-caching.
Open Datasets Yes We release code and data at https://github.com/chenchenygu/auditing-prompt-caching. Our distribution P of prompts is a uniform distribution over all prompts consisting of PROMPTLENGTH English letters, lowercase and uppercase, each separated by space characters, e.g., m x N j R .
Dataset Splits No The paper describes how samples for statistical testing are collected (`NUMSAMPLES = 250` for both cache hit and cache miss procedures) and randomized, but it does not involve traditional training/test/validation splits as would be used for training a machine learning model. The experiment is an audit, not a model training task.
Hardware Specification No The paper mentions 'GPU processing time' and 'different GPU models' in the context of general LLM operation, but it does not provide specific hardware details (like model numbers or types of GPUs/CPUs) used for running their own experiments. It only states that 'clients located in California' were used for the audits.
Software Dependencies No We use the Sci Py implementation (Virtanen et al., 2020) of the two-sample Kolmogorov-Smirnov (KS) test. While SciPy is mentioned, a specific version number for the SciPy library itself is not provided.
Experiment Setup Yes The procedure uses the following configuration parameters: PROMPTLENGTH, PREFIXFRACTION, NUMVICTIMREQUESTS, and NUMSAMPLES. For our audits, we use PROMPTLENGTH = 5000 and NUMSAMPLES = 250. We use a significance level of α = 10 8. In the remaining three levels, we test for prompt caching when x and x have the same prefix but different suffixes by setting PREFIXFRACTION = 0.95. To determine how many victim requests are needed to detect caching, we run tests using NUMVICTIMREQUESTS {1, 5, 25} in increasing order, stopping after the first significant p-value.