Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

Authors: Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, Minjoon Seo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model s ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model s knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model s capacity for knowledge acquisition and retention.
Researcher Affiliation Collaboration Jiyeon Kim1 Hyunji Lee1 Hyowon Cho1 Joel Jang2 Hyeonbin Hwang1 Seungpil Won3 Youbin Ahn3 Dohaeng Lee3 Minjoon Seo1 1 KAIST AI 2 University of Washington 3 LG AI Research EMAIL
Pseudocode Yes Algorithm 1 Resuscitating Low Memory Coefficients
Open Source Code Yes 1Code in https://github.com/kaist AI/Knowledge-Entropy.git
Open Datasets Yes To conduct the experiment, we use the OLMo (Groeneveld et al., 2024) models (1B and 7B), which are open-source large language models with intermediate pretraining checkpoints released, trained on the Dolma dataset (Soldaini et al., 2024)5. To measure knowledge entropy, we use a subset of Dolma, 2k instances that appear in the first batch within the official pretraining data order to ensure that all models we are using have seen the corpus during pretraining step. Please note that the trend persists across other corpora as well (Figure 7 in Appendix A.2); however, since we are analyzing the model s behavior throughout training, we define knowledge entropy based on calculations using the training dataset. Dataset We experiment on a subset of two datasets7: Pub Med 8, a corpus of bio-medical and life science topics with abstracts, and C4 (Raffel et al., 2020), a large-scale corpus comprising diverse text data gathered from web pages. We use Pub Med as the primary dataset as it contains more new knowledge, making it a better fit for our continual knowledge learning setup (Appendix B.1). In addition to the dataset, we inject synthetic knowledge during training to assess the model s ability to acquire new information. Specifically, we utilize FICTIONAL KNOWLEDGE dataset (Chang et al., 2024), which is designed to assess how well language models acquire factual knowledge during pretraining9.
Dataset Splits Yes We randomly sample 205k instances for each dataset. In Chang et al. (2024), the probes are divided into three levels of difficulty, with five sentences created for each level. This results in 15 probes per corpus. The difficulty levels are as follows: 1) Memorization probes directly ask about sentences explicitly present in the fictional corpus. 2) Semantic generalization probes are paraphrased versions of the memorization probes to test the model s understanding of meaning beyond surface forms. 3) Compositional generalization probes are designed to assess whether the model can integrate multiple pieces of knowledge from the fictional corpus. The injected knowledge is incorporated into the training corpus during continual learning, with updates occurring every 160 steps. Following Chang et al. (2024), we divide the 130 corpora into two settings: paraphrase and once. In the paraphrase setting, 70 instances are each paraphrased 10 times. For every 160 steps, one paraphrased version of an instance is added to the training corpus, repeating this process 10 times14. In the once setting, each instance is presented only once throughout the entire continual learning process. The 60 instances are divided into 10 groups, with 6 instances added every 160 steps.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions using OLMo models.
Software Dependencies No The paper mentions using "adam W optimizer" and "cosine LR scheduler with warmup" but does not specify any software libraries with version numbers (e.g., PyTorch 1.x, Python 3.x, CUDA 11.x).
Experiment Setup Yes Hyperparameters are chosen following previous research on continual knowledge learning (Jang et al., 2022; Kim et al., 2023) and we test various combinations to assess their generalizability. For batch size, we test 128 and 2048; for learning rate, we experiment with 1e-4, 4e-4, and 1e-3. We also investigate the effect of training duration by comparing a single epoch to three epochs. Among these configurations, we focus primarily on a batch size of 128, a learning rate of 4e-4, and single-epoch training as this setup most closely aligns with continual knowledge learning. Our base experiments are conducted using a hyperparameter configuration most closely aligned with continual knowledge learning studies, specifically with a batch size of 128, a learning rate of 4e-4, while training single-epoch of Pub Med corpus. We use adam W optimizer (β = 0.9, 0.95, weight decay= 0.1), cosine LR scheduler with warmup=0.05, and set maximum sequence length as 1024. We randomly selected 204,800 instances from the Pub Med and C4 datasets and matched the sequence length to 1,024 tokens by concatenating instances. This resulted in a training dataset consisting of approximately 210 million tokens.