reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Independence Tests for Language Models

Authors: Sally Zhu, Ahmed M Ahmed, Rohith Kuditipudi, Percy Liang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Stanford, US. Correspondence to: Sally Zhu <EMAIL>, Ahmed Ahmed <EMAIL>.
Pseudocode	Yes	Algorithm 1: Test for computing p-values (PERMTEST) Input: Model weights θ1, θ2 Parameters :test statistic ϕ; discrete transformation class Π; permutation count T Output: p-value ˆp (0, 1] Algorithm 2: Cosine similarity matching (MATCH) Input: Matrices W1, W2 with h rows Output: Permutation π : [h] [h] Algorithm 3: Deriving p-values from Spearman correlation (SPEARMAN) Input: Permutations π1, π2 : [h] [h] Output: p-value ˆp (0, 1] Algorithm 4: Aggregating p-values (FISHER) Input: p-values {bp(i)}L i=1 Output: p-value ˆp (0, 1] Algorithm 5: Generalized robust test Input: Model parameters θ1, θ2 Θ Parameters :distribution P over Rd Output: bp [0, 1]
Open Source Code	Yes	We share code at https://github.com/ahmeda14960/model-tracing.
Open Datasets	Yes	We computed ϕJSD using input sequences sampled from Wiki Text-103 (Merity et al., 2017; Xu et al., 2024) (consistent with prior work). We trained a second model with independently chosen initialization and data ordering. Specifically, we ensure that our test does not incorrectly detect two similar (trained using the same learning algorithm) but independent (randomly initialized) models, as non-independent. To verify this, we randomly initialized a model with the OLMo (7B) architecture (Groeneveld et al., 2024) and trained it on the Dolma v1 7 dataset ((Soldaini et al., 2024)).
Dataset Splits	No	The paper primarily analyzes pre-trained models. For the OLMo models, it states: "We keep checkpoints for both seeds after 100M, 1B, 10B, and 18B train tokens", which refers to training checkpoints rather than explicit train/test/validation splits for evaluation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	We reinitialize the first GLU MLP module of a model θ1 with an MLP with double the width, and using Algorithm 5 (generalized robust test), we train ˆθ1 with random Gaussians as the training distribution P. We retrain each of the 32 MLPs (keeping other layers fixed) of vicuna-7b-v1.5 (a finetune of Llama-2-7b-hf) for 10k gradient steps (until the loss curve plateus). (Additional hyperparameters and a learning curve are in Appendix F.2.) We train for 10000 gradient steps using MSE loss and an Adam Optimizer with a learning rate of 0.001 and batch size of 5000.