reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Authors: Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora. The entire Section 4 is dedicated to "Empirical analysis" and presents numerous experimental results in tables and figures.
Researcher Affiliation	Collaboration	Liyi Zhang EMAIL Department of Computer Science Princeton University Michael Y. Li EMAIL Department of Computer Science Stanford University R. Thomas Mc Coy EMAIL Department of Linguistics and Wu Tsai Institute Yale University Theodore R. Sumers EMAIL Anthropic Jian-Qiao Zhu EMAIL Department of Computer Science Princeton University Thomas L. Griffiths EMAIL Departments of Psychology and Computer Science Princeton University
Pseudocode	No	The paper describes methods and processes in narrative text and mathematical formulations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at github.com/zhang-liyi/llm-embeddings.
Open Datasets	Yes	Latent Dirichlet Allocation (LDA; Blei et al. (2001)) is an exchangeable generative model that is widely used for modelling the topic structure of documents. We use 20Newsgroups (20NG) and Wiki Text-103 (Merity et al., 2016).
Dataset Splits	Yes	Each dataset is split into three sets: set 1, set 2, and set 3. Set 1 is used for training the transformer. Set 2 is used for validating the transformer and getting embeddings from transformer that are used to train the probe. Set 3 is used for validating the probe. Except discrete hypothesis space datasets and natural corpora, the sizes for the three sets are: 10000, 3000, 1000, and each sequence is 500-tokens long. In the discrete hypothesis space datasets, we experimented with different sequence lengths (detailed in our results), and the sizes for the three sets are: 20000, 19000, 1000. In HMM-LDA, sequence lengths are 400, and the sizes for the three sets are 10000, 1000, 1000. On 20NG, probe training and validation are run on 11,314 and 7,532 documents, respectively. On Wiki Text-103, probe training and validation are run on 28,475 and 60 documents, respectively.
Hardware Specification	Yes	All computations for synthetic datasets are run on single Tesla T4 GPUs, and those for natural corpora are run on single A100 GPUs.
Software Dependencies	No	The paper mentions software components and algorithms like the "Adam optimizer" and "linear mixed-effects model" but does not provide specific version numbers for these or other key software dependencies to ensure reproducibility.
Experiment Setup	Yes	Dropout = 0.1 is applied, and learning rate = 0.001, batch-size = 64. Learning rate is tuned in [0.001, 0.01], and batch-size = 64. Autoregressive transformer (AT) and Bert hyperparameters for training are given in Table 9. Probe hyperparameters for training on top of synthetic dataset language models are given in Table 10. Probe hyperparameters for training on top of Gpt-2, Gpt-2-medium, Gpt-2-large, Bert, and Bert-large are given in Table 11. Probe hyperparameters for training on top of Llama 2 and Llama 2-chat are given in Table 12.