What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions
Authors: Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, Thomas L. Griffiths
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora. The entire Section 4 is dedicated to "Empirical analysis" and presents numerous experimental results in tables and figures. |
| Researcher Affiliation | Collaboration | Liyi Zhang EMAIL Department of Computer Science Princeton University Michael Y. Li EMAIL Department of Computer Science Stanford University R. Thomas Mc Coy EMAIL Department of Linguistics and Wu Tsai Institute Yale University Theodore R. Sumers EMAIL Anthropic Jian-Qiao Zhu EMAIL Department of Computer Science Princeton University Thomas L. Griffiths EMAIL Departments of Psychology and Computer Science Princeton University |
| Pseudocode | No | The paper describes methods and processes in narrative text and mathematical formulations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/zhang-liyi/llm-embeddings. |
| Open Datasets | Yes | Latent Dirichlet Allocation (LDA; Blei et al. (2001)) is an exchangeable generative model that is widely used for modelling the topic structure of documents. We use 20Newsgroups (20NG) and Wiki Text-103 (Merity et al., 2016). |
| Dataset Splits | Yes | Each dataset is split into three sets: set 1, set 2, and set 3. Set 1 is used for training the transformer. Set 2 is used for validating the transformer and getting embeddings from transformer that are used to train the probe. Set 3 is used for validating the probe. Except discrete hypothesis space datasets and natural corpora, the sizes for the three sets are: 10000, 3000, 1000, and each sequence is 500-tokens long. In the discrete hypothesis space datasets, we experimented with different sequence lengths (detailed in our results), and the sizes for the three sets are: 20000, 19000, 1000. In HMM-LDA, sequence lengths are 400, and the sizes for the three sets are 10000, 1000, 1000. On 20NG, probe training and validation are run on 11,314 and 7,532 documents, respectively. On Wiki Text-103, probe training and validation are run on 28,475 and 60 documents, respectively. |
| Hardware Specification | Yes | All computations for synthetic datasets are run on single Tesla T4 GPUs, and those for natural corpora are run on single A100 GPUs. |
| Software Dependencies | No | The paper mentions software components and algorithms like the "Adam optimizer" and "linear mixed-effects model" but does not provide specific version numbers for these or other key software dependencies to ensure reproducibility. |
| Experiment Setup | Yes | Dropout = 0.1 is applied, and learning rate = 0.001, batch-size = 64. Learning rate is tuned in [0.001, 0.01], and batch-size = 64. Autoregressive transformer (AT) and Bert hyperparameters for training are given in Table 9. Probe hyperparameters for training on top of synthetic dataset language models are given in Table 10. Probe hyperparameters for training on top of Gpt-2, Gpt-2-medium, Gpt-2-large, Bert, and Bert-large are given in Table 11. Probe hyperparameters for training on top of Llama 2 and Llama 2-chat are given in Table 12. |