Independence Tests for Language Models
Authors: Sally Zhu, Ahmed M Ahmed, Rohith Kuditipudi, Percy Liang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, Stanford, US. Correspondence to: Sally Zhu <EMAIL>, Ahmed Ahmed <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Test for computing p-values (PERMTEST) Input: Model weights θ1, θ2 Parameters :test statistic ϕ; discrete transformation class Π; permutation count T Output: p-value ˆp (0, 1] Algorithm 2: Cosine similarity matching (MATCH) Input: Matrices W1, W2 with h rows Output: Permutation π : [h] [h] Algorithm 3: Deriving p-values from Spearman correlation (SPEARMAN) Input: Permutations π1, π2 : [h] [h] Output: p-value ˆp (0, 1] Algorithm 4: Aggregating p-values (FISHER) Input: p-values {bp(i)}L i=1 Output: p-value ˆp (0, 1] Algorithm 5: Generalized robust test Input: Model parameters θ1, θ2 Θ Parameters :distribution P over Rd Output: bp [0, 1] |
| Open Source Code | Yes | We share code at https://github.com/ahmeda14960/model-tracing. |
| Open Datasets | Yes | We computed ϕJSD using input sequences sampled from Wiki Text-103 (Merity et al., 2017; Xu et al., 2024) (consistent with prior work). We trained a second model with independently chosen initialization and data ordering. Specifically, we ensure that our test does not incorrectly detect two similar (trained using the same learning algorithm) but independent (randomly initialized) models, as non-independent. To verify this, we randomly initialized a model with the OLMo (7B) architecture (Groeneveld et al., 2024) and trained it on the Dolma v1 7 dataset ((Soldaini et al., 2024)). |
| Dataset Splits | No | The paper primarily analyzes pre-trained models. For the OLMo models, it states: "We keep checkpoints for both seeds after 100M, 1B, 10B, and 18B train tokens", which refers to training checkpoints rather than explicit train/test/validation splits for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We reinitialize the first GLU MLP module of a model θ1 with an MLP with double the width, and using Algorithm 5 (generalized robust test), we train ˆθ1 with random Gaussians as the training distribution P. We retrain each of the 32 MLPs (keeping other layers fixed) of vicuna-7b-v1.5 (a finetune of Llama-2-7b-hf) for 10k gradient steps (until the loss curve plateus). (Additional hyperparameters and a learning curve are in Appendix F.2.) We train for 10000 gradient steps using MSE loss and an Adam Optimizer with a learning rate of 0.001 and batch size of 5000. |