Layer by Layer: Uncovering Hidden Representations in Language Models

Authors: Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann Lecun, Ravid Shwartz-Ziv

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features... In this section, we empirically test our theoretical framework through extensive experiments across architectures, scales, and training regimes.
Researcher Affiliation Collaboration 1University of Kentucky 2Mila 3University of Montreal 4New York University 5University of California, Los Angeles 6Independent 7Meta FAIR 8Wand.AI.
Pseudocode No The paper includes a pseudocode snippet in Appendix D, but this is an example of how to use an existing library (NLPAug) for data augmentation, not structured pseudocode for the main methodology or algorithms proposed in the paper.
Open Source Code Yes 1We make our code available at https://github.com/OFSkean/information_flow
Open Datasets Yes Through systematic evaluation on 32 embedding tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022)... We use WikiText-103 (Merity et al., 2017) for analyzing our representation metrics on standard textual data... We evaluate every model layer on Image Net-1k with attention probing and our suite of metrics. (Tian et al., 2020)
Dataset Splits No The paper refers to MTEB tasks and well-known datasets like WikiText-103 and ImageNet-1k, which often have standard splits. However, it does not explicitly provide the split percentages, sample counts, or specific details for reproduction in the main text or the provided appendix sections. For instance, for WikiText, it mentions filtering, but not train/val/test splits. For ImageNet-1k, 'Validation Top-1 Accuracy' implies a validation set but not the splitting methodology.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models, or detailed computational resources.
Software Dependencies No For the augmentation-invariance metrics such as info NCE, Li DAR, and Di ME, we use the NLPAug library (Ma, 2019) to augment our prompts. While a specific software library (NLPAug) is mentioned, its version number is not provided, nor are other key software components with their versions.
Experiment Setup No We evaluate three distinct architectural families: Pythia and Llama3 (decoder-only transformers) (Biderman et al., 2023; Dubey et al., 2024), Mamba (state space model) (Gu & Dao, 2024), BERT (encoder-only transformer) (Devlin et al., 2019) and LLM2Vec models (bidirectional attention) (Behnam Ghader et al., 2024). Tasks We test each layer s embeddings on 32 tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), spanning classification, clustering, and reranking dor a comprehensive evaluation across various tasks. We refer to the Appendix for details. The paper describes the models and tasks used but does not provide specific hyperparameter values, optimizer settings, or other detailed system-level training configurations necessary for reproducing the experiments.