reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model Equality Testing: Which Model is this API Serving?

Authors: Irena Gao, Percy Liang, Carlos Guestrin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
Researcher Affiliation	Academia	Irena Gao, Percy Liang, Carlos Guestrin EMAIL, EMAIL, EMAIL Stanford University
Pseudocode	No	The paper only describes the statistical testing procedure and kernel choices using mathematical equations and prose, but does not provide a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	To enable users to audit APIs for custom applications, we open-source a Python package.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing.
Open Datasets	Yes	We also encourage future research in Model Equality Testing by releasing a dataset of 1 million LLM completions from five models.2 Package, experiment code, and dataset: https://github.com/i-gao/model-equality-testing.
Dataset Splits	Yes	We finetune Llama-3 8B Instruct on two datasets: a disjoint, i.i.d. split of the testing Wikipedia task, and an out-of-distribution code dataset (Chaudhary, 2023).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments or simulations.
Software Dependencies	No	The paper mentions open-sourcing a Python package and using models from Hugging Face, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	All experiments in this section are run on a longform language modeling task. The prompt distribution π is a uniform distribution over m = 25 random 100-character strings sampled from English, German, Spanish, French, and Russian Wikipedia (Box 1). The maximum completion length is L = 50, and we sample using temperature 1. Power is computed from 100 Monte Carlo simulations. We estimate p-values by simulating the empirical distribution of the test statistic under the null 1000 times; in Appendix C, we validate that the permutation procedure results in the same trends. All tests are conducted at a significance level of α = 0.05. ... We use a small learning rate of 1 × 10−6 with AdamW (Loshchilov, 2017).