Model Equality Testing: Which Model is this API Serving?
Authors: Irena Gao, Percy Liang, Carlos Guestrin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta. |
| Researcher Affiliation | Academia | Irena Gao, Percy Liang, Carlos Guestrin EMAIL, EMAIL, EMAIL Stanford University |
| Pseudocode | No | The paper only describes the statistical testing procedure and kernel choices using mathematical equations and prose, but does not provide a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | To enable users to audit APIs for custom applications, we open-source a Python package.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing. |
| Open Datasets | Yes | We also encourage future research in Model Equality Testing by releasing a dataset of 1 million LLM completions from five models.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing. |
| Dataset Splits | Yes | We finetune Llama-3 8B Instruct on two datasets: a disjoint, i.i.d. split of the testing Wikipedia task, and an out-of-distribution code dataset (Chaudhary, 2023). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments or simulations. |
| Software Dependencies | No | The paper mentions open-sourcing a Python package and using models from Hugging Face, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | All experiments in this section are run on a longform language modeling task. The prompt distribution π is a uniform distribution over m = 25 random 100-character strings sampled from English, German, Spanish, French, and Russian Wikipedia (Box 1). The maximum completion length is L = 50, and we sample using temperature 1. Power is computed from 100 Monte Carlo simulations. We estimate p-values by simulating the empirical distribution of the test statistic under the null 1000 times; in Appendix C, we validate that the permutation procedure results in the same trends. All tests are conducted at a significance level of α = 0.05. ... We use a small learning rate of 1 × 10−6 with AdamW (Loshchilov, 2017). |