reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Layer by Layer: Uncovering Hidden Representations in Language Models

Authors: Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, Ravid Shwartz-Ziv

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features... In this section, we empirically test our theoretical framework through extensive experiments across architectures, scales, and training regimes.
Researcher Affiliation	Collaboration	1University of Kentucky 2Mila 3University of Montreal 4New York University 5University of California, Los Angeles 6Independent 7Meta FAIR 8Wand.AI.
Pseudocode	No	The paper includes a pseudocode snippet in Appendix D, but this is an example of how to use an existing library (NLPAug) for data augmentation, not structured pseudocode for the main methodology or algorithms proposed in the paper.
Open Source Code	Yes	1We make our code available at https://github.com/OFSkean/information_flow
Open Datasets	Yes	Through systematic evaluation on 32 embedding tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022)... We use WikiText-103 (Merity et al., 2017) for analyzing our representation metrics on standard textual data... We evaluate every model layer on Image Net-1k with attention probing and our suite of metrics. (Tian et al., 2020)
Dataset Splits	No	The paper refers to MTEB tasks and well-known datasets like WikiText-103 and ImageNet-1k, which often have standard splits. However, it does not explicitly provide the split percentages, sample counts, or specific details for reproduction in the main text or the provided appendix sections. For instance, for WikiText, it mentions filtering, but not train/val/test splits. For ImageNet-1k, 'Validation Top-1 Accuracy' implies a validation set but not the splitting methodology.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models, or detailed computational resources.
Software Dependencies	No	For the augmentation-invariance metrics such as info NCE, Li DAR, and Di ME, we use the NLPAug library (Ma, 2019) to augment our prompts. While a specific software library (NLPAug) is mentioned, its version number is not provided, nor are other key software components with their versions.
Experiment Setup	No	We evaluate three distinct architectural families: Pythia and Llama3 (decoder-only transformers) (Biderman et al., 2023; Dubey et al., 2024), Mamba (state space model) (Gu & Dao, 2024), BERT (encoder-only transformer) (Devlin et al., 2019) and LLM2Vec models (bidirectional attention) (Behnam Ghader et al., 2024). Tasks We test each layer s embeddings on 32 tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), spanning classification, clustering, and reranking dor a comprehensive evaluation across various tasks. We refer to the Appendix for details. The paper describes the models and tasks used but does not provide specific hyperparameter values, optimizer settings, or other detailed system-level training configurations necessary for reproducing the experiments.