reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Authors: Nghiem Tuong Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention. Our implementation is publicly available on Git Hub. 5. Experiments To highlight the statistical advantages of zero-initialized attention and explore the potential of non-linear prompts, we conduct a series of question-answering experiments on LLM tasks. Section 5.1 provides an overview of our experimental setup, while the main results are presented in Section 5.2.
Researcher Affiliation	Collaboration	1German Research Center for Artificial Intelligence (DFKI) 2University of Science, VNU-HCM, Ho Chi Minh City, Vietnam 3Viet Nam National University, Ho Chi Minh City, Vietnam 4The University of Texas at Austin 5Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc. 6Work was completed while an employee at Qualcomm 7Max Planck Research School for Intelligent Systems (IMPRS-IS) 8University of Stuttgart 9Oldenburg University.
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Our implementation is publicly available on Git Hub.
Open Datasets	Yes	We use the Open LLM benchmarks as in Beeching et al. (2024). These benchmarks evaluate the generative abilities of LLMs in four different tasks, including (i) AI2 Reasoning Challenge (ARC) with Easy (eas) and Challenge (cha) types (Clark et al., 2018), (ii) Hella Swag (Zellers et al., 2019), (iii) MMLU (Hendrycks et al., 2020), and (iv) Truthful QA (Lin et al., 2021). All these tasks evaluate the model through multiple-choice questions
Dataset Splits	Yes	We follow the experimental setup of LLa MA-Adapter (Zhang et al., 2024) by fine-tuning LLa MA on the Alpaca dataset (Taori et al., 2023). The model performance is evaluated on the test set by conducting a zero-shot evaluation for ARC, MMLU, and Truthful QA while using a 10-shot setting for Hella Swag. Here, n -shot refers to incorporating n instruction-following samples into the prompt question. Specifically, we randomly subsample the Alpaca dataset at different fractions {1%, 10%, 30%, 50%, 100%} to simulate low-data scenarios. We then fine-tune the Non-Linear, Linear, and Random-Init prompts on these subsets for both LLa MA7B and LLa MA-13B and evaluate their performance on the ARC dataset. This experiment allows us to assess how well each initialization strategy adapts to limited data and whether zero-initialized attention provides a consistent advantage in sample efficiency. Table 4. Statistics of 4 LLM benchmarks about the testing subset.
Hardware Specification	Yes	The models are trained with 4 A100-GPUs for 5 epochs. The training is conducted on 4 GPUs A100-80GB.
Software Dependencies	No	The paper mentions software like "LLaMA models", "LLaMA-Adapter", "Chat GPT", "GPT-4", "LoRA", "IA3", and "VeRA", which are models or methods used. However, it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The models are trained with 4 A100-GPUs for 5 epochs. The training configuration includes a warmup period of 2 epochs, a total batch size of 64, a learning rate of 0.009, and a weight decay of 0.02. With LLa MA-7B, we use a prompt with length L = 10 and integrate adaptation prompts into the last K = 30 layers. On LLa MA-13B, we use L = 10 and insert prompts at the last K = 38 layers.