Enhancing Elusive Clues in Knowledge Learning by Contrasting Attention of Language Models
Authors: Jian Gao, Xiao Zhang, Miao Li, Ji Wu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimented on both synthetic and real-world corpus and show that the proposed method outperforms other forms of data augmentation, and boosting elusive clues universally helps both the large and the small models. [...] Results in Table 2 show that the proposed token-dropout augmentation based on attention difference significantly outperforms other data augmentation methods. |
| Researcher Affiliation | Academia | Jian Gao2, Xiao Zhang1, Miao Li1,B, Ji Wu1,3,4 1Department of Electronic Engineering, Tsinghua University 2Department of Energy and Power Engineering, Tsinghua University 3College of AI, Tsinghua University 4Beijing National Research Center for Information Science and Technology EMAIL EMAIL |
| Pseudocode | No | We use the following function to calculate dropout probability for each token: p(r) = α(1 e βr) (1) [...] Figure 4 illustrates the process of the proposed augmentation method. |
| Open Source Code | Yes | Code https://github.com/tsinghua-msiip/contrasting_attention [...] We release the code and data used in this paper for reproducibility and further research. |
| Open Datasets | Yes | Zhu and Li (2023) introduced a synthetic biography dataset for evaluating the efficiency of knowledge learning in language models. [...] Aside from the biography dataset, we also evaluate the proposed method on Wikipedia text to verify if the method helps knowledge learning on general text. Specifically, we evaluate on the Paragraph-Level Wikipedia Question-Answering dataset (Du and Cardie 2018). |
| Dataset Splits | No | The task is to finetune (continual pretraining) a language model on the biographies to let it memorize the factual information about the individuals. After training, the model is evaluated on a question-answering task [...] For each experiment, we trained the model from 10 to 30 epochs with learning rates in [5e-5, 1e-3] and selected the model with the best performance. |
| Hardware Specification | Yes | We finetune models with the Huggingface’s transformer library (Wolf et al. 2020) on NVIDIA 4090 GPUs. |
| Software Dependencies | No | We use low-rank adaptation (Lo RA) (Hu et al. 2022) to facilitate finetuning of models up to 70 billion parameters. [...] We finetune models with the Huggingface’s transformer library (Wolf et al. 2020) on NVIDIA 4090 GPUs. |
| Experiment Setup | Yes | We use low-rank adaptation (Lo RA) (Hu et al. 2022) to facilitate finetuning of models up to 70 billion parameters. As the corpus size is limited, we use a rank of 16 for the Lo RA adapters. Adapters are added to all of the model’s weights except for the embedding and the output layer. [...] For each experiment, we trained the model from 10 to 30 epochs with learning rates in [5e-5, 1e-3] and selected the model with the best performance. [...] For each of the augmentation methods, we generate 10 augmented versions of each training example and combine them with the original examples. [...] We also searched for the best hyperparameters α and β individually for each method. |