On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Authors: Nghiem Tuong Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention. Our implementation is publicly available on Git Hub. 5. Experiments To highlight the statistical advantages of zero-initialized attention and explore the potential of non-linear prompts, we conduct a series of question-answering experiments on LLM tasks. Section 5.1 provides an overview of our experimental setup, while the main results are presented in Section 5.2. |
| Researcher Affiliation | Collaboration | 1German Research Center for Artificial Intelligence (DFKI) 2University of Science, VNU-HCM, Ho Chi Minh City, Vietnam 3Viet Nam National University, Ho Chi Minh City, Vietnam 4The University of Texas at Austin 5Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. 6Work was completed while an employee at Qualcomm 7Max Planck Research School for Intelligent Systems (IMPRS-IS) 8University of Stuttgart 9Oldenburg University. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our implementation is publicly available on Git Hub. |
| Open Datasets | Yes | We use the Open LLM benchmarks as in Beeching et al. (2024). These benchmarks evaluate the generative abilities of LLMs in four different tasks, including (i) AI2 Reasoning Challenge (ARC) with Easy (eas) and Challenge (cha) types (Clark et al., 2018), (ii) Hella Swag (Zellers et al., 2019), (iii) MMLU (Hendrycks et al., 2020), and (iv) Truthful QA (Lin et al., 2021). All these tasks evaluate the model through multiple-choice questions |
| Dataset Splits | Yes | We follow the experimental setup of LLa MA-Adapter (Zhang et al., 2024) by fine-tuning LLa MA on the Alpaca dataset (Taori et al., 2023). The model performance is evaluated on the test set by conducting a zero-shot evaluation for ARC, MMLU, and Truthful QA while using a 10-shot setting for Hella Swag. Here, n -shot refers to incorporating n instruction-following samples into the prompt question. Specifically, we randomly subsample the Alpaca dataset at different fractions {1%, 10%, 30%, 50%, 100%} to simulate low-data scenarios. We then fine-tune the Non-Linear, Linear, and Random-Init prompts on these subsets for both LLa MA7B and LLa MA-13B and evaluate their performance on the ARC dataset. This experiment allows us to assess how well each initialization strategy adapts to limited data and whether zero-initialized attention provides a consistent advantage in sample efficiency. Table 4. Statistics of 4 LLM benchmarks about the testing subset. |
| Hardware Specification | Yes | The models are trained with 4 A100-GPUs for 5 epochs. The training is conducted on 4 GPUs A100-80GB. |
| Software Dependencies | No | The paper mentions software like "LLaMA models", "LLaMA-Adapter", "Chat GPT", "GPT-4", "LoRA", "IA3", and "VeRA", which are models or methods used. However, it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The models are trained with 4 A100-GPUs for 5 epochs. The training configuration includes a warmup period of 2 epochs, a total batch size of 64, a learning rate of 0.009, and a weight decay of 0.02. With LLa MA-7B, we use a prompt with length L = 10 and integrate adaptation prompts into the last K = 30 layers. On LLa MA-13B, we use L = 10 and insert prompts at the last K = 38 layers. |