Independence Tests for Language Models

Authors: Sally Zhu, Ahmed M Ahmed, Rohith Kuditipudi, Percy Liang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, Stanford, US. Correspondence to: Sally Zhu <EMAIL>, Ahmed Ahmed <EMAIL>.
Pseudocode Yes Algorithm 1: Test for computing p-values (PERMTEST) Input: Model weights θ1, θ2 Parameters :test statistic ϕ; discrete transformation class Π; permutation count T Output: p-value ˆp (0, 1] Algorithm 2: Cosine similarity matching (MATCH) Input: Matrices W1, W2 with h rows Output: Permutation π : [h] [h] Algorithm 3: Deriving p-values from Spearman correlation (SPEARMAN) Input: Permutations π1, π2 : [h] [h] Output: p-value ˆp (0, 1] Algorithm 4: Aggregating p-values (FISHER) Input: p-values {bp(i)}L i=1 Output: p-value ˆp (0, 1] Algorithm 5: Generalized robust test Input: Model parameters θ1, θ2 Θ Parameters :distribution P over Rd Output: bp [0, 1]
Open Source Code Yes We share code at https://github.com/ahmeda14960/model-tracing.
Open Datasets Yes We computed ϕJSD using input sequences sampled from Wiki Text-103 (Merity et al., 2017; Xu et al., 2024) (consistent with prior work). We trained a second model with independently chosen initialization and data ordering. Specifically, we ensure that our test does not incorrectly detect two similar (trained using the same learning algorithm) but independent (randomly initialized) models, as non-independent. To verify this, we randomly initialized a model with the OLMo (7B) architecture (Groeneveld et al., 2024) and trained it on the Dolma v1 7 dataset ((Soldaini et al., 2024)).
Dataset Splits No The paper primarily analyzes pre-trained models. For the OLMo models, it states: "We keep checkpoints for both seeds after 100M, 1B, 10B, and 18B train tokens", which refers to training checkpoints rather than explicit train/test/validation splits for evaluation.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or detailed computer specifications used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes We reinitialize the first GLU MLP module of a model θ1 with an MLP with double the width, and using Algorithm 5 (generalized robust test), we train ˆθ1 with random Gaussians as the training distribution P. We retrain each of the 32 MLPs (keeping other layers fixed) of vicuna-7b-v1.5 (a finetune of Llama-2-7b-hf) for 10k gradient steps (until the loss curve plateus). (Additional hyperparameters and a learning curve are in Appendix F.2.) We train for 10000 gradient steps using MSE loss and an Adam Optimizer with a learning rate of 0.001 and batch size of 5000.