Revisiting Topic-Guided Language Models

Authors: Carolina Zheng, Keyon Vafa, David Blei

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we detail the reproducibility study and results. We also investigate the quality of learned topics and probe the LSTM-LM s hidden representations to find the amount of retained topic information.
Researcher Affiliation Academia Carolina Zheng EMAIL Department of Computer Science Columbia University Keyon Vafa EMAIL Department of Computer Science Columbia University David M. Blei EMAIL Department of Statistics Department of Computer Science Columbia University
Pseudocode No The paper describes models and their components using mathematical equations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We make public all code used for this study.1 1https://github.com/carolinazheng/revisiting-tglms
Open Datasets Yes We use four publicly available natural language datasets: APNEWS,2 IMDB (Maas et al., 2011), BNC (Consortium, 2007), and Wiki Text-2 (Merity et al., 2017). We follow the training, validation, and test splits from Lau et al. (2017) and Merity et al. (2017).
Dataset Splits Yes We follow the training, validation, and test splits from Lau et al. (2017) and Merity et al. (2017). Table 4 shows the dataset statistics. The data is preprocessed as follows. For Wiki Text-2, we use the standard vocabulary, tokenization, and splits from Merity et al. (2017).
Hardware Specification Yes The models in our codebase train to convergence within three days on a single Tesla V100 GPU. r GBN-RNN, trained using its public codebase, trains to convergence within one week on the same GPU. The experiments can be replicated on an AWS Tesla V100 GPU with 16GB GPU memory.
Software Dependencies Yes LSTM-LM, Topic RNN, VRTM, and TDLM are implemented in our codebase in Pytorch 1.12. We use the original implementation of r GBN-RNN, which uses Tensorflow 1.9.
Experiment Setup Yes For all LSTM-LM baselines, we use a hidden size of 600, word embeddings of size 300 initialized with Google News word2vec embeddings (Mikolov et al., 2013), and dropout of 0.4 between the LSTM input and output layers (and between the hidden layers for the 3-layer models). We train the RNN components using truncated backpropagation through time with a sequence length of 30. Following Lau et al. (2017), Rezaee & Ferraro (2020), and Guo et al. (2020), we use the Adam optimizer with a learning rate of 0.001 on APNEWS, IMDB, and BNC. For Wiki Text-2, we follow Merity et al. (2017) and use stochastic gradient descent; the initial learning rate is 20 and is divided by 4 when validation perplexity is worse than the previous iteration. The models are trained until validation perplexity does not improve for 5 epochs and we use the best validation checkpoint. We train all models on single GPUs with a language model batch size of 64. We train LDA via Gibbs sampling using Mallet (Mc Callum, 2002). The hyperparameters are: ̑̑ (topic density) = 50, Β (word density) = 0.01, number of iterations = 1000.