Model Equality Testing: Which Model is this API Serving?

Authors: Irena Gao, Percy Liang, Carlos Guestrin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
Researcher Affiliation Academia Irena Gao, Percy Liang, Carlos Guestrin EMAIL, EMAIL, EMAIL Stanford University
Pseudocode No The paper only describes the statistical testing procedure and kernel choices using mathematical equations and prose, but does not provide a clearly labeled pseudocode block or algorithm.
Open Source Code Yes To enable users to audit APIs for custom applications, we open-source a Python package.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing.
Open Datasets Yes We also encourage future research in Model Equality Testing by releasing a dataset of 1 million LLM completions from five models.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing.
Dataset Splits Yes We finetune Llama-3 8B Instruct on two datasets: a disjoint, i.i.d. split of the testing Wikipedia task, and an out-of-distribution code dataset (Chaudhary, 2023).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments or simulations.
Software Dependencies No The paper mentions open-sourcing a Python package and using models from Hugging Face, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All experiments in this section are run on a longform language modeling task. The prompt distribution π is a uniform distribution over m = 25 random 100-character strings sampled from English, German, Spanish, French, and Russian Wikipedia (Box 1). The maximum completion length is L = 50, and we sample using temperature 1. Power is computed from 100 Monte Carlo simulations. We estimate p-values by simulating the empirical distribution of the test statistic under the null 1000 times; in Appendix C, we validate that the permutation procedure results in the same trends. All tests are conducted at a significance level of α = 0.05. ... We use a small learning rate of 1 × 10−6 with AdamW (Loshchilov, 2017).