An Analysis for Reasoning Bias of Language Models with Small Initialization
Authors: Junjie Yao, Zhongwang Zhang, Zhi-Qin John Xu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. |
| Researcher Affiliation | Collaboration | 1School of Mathematical Sciences, Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, P.R. China 2Institute of Natural Sciences, School of Mathematical Sciences, MOE-LSC, School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, P.R. China 3Center for LLM, Institute for Advanced Algorithms Research, Shanghai, P.R. China 4Shanghai Seres Information Technology Co., Ltd, Shanghai 200040, China. Correspondence to: Zhongwang Zhang <EMAIL>, Zhi-Qin John Xu <EMAIL>. |
| Pseudocode | No | The paper describes methods and definitions mathematically, including propositions and lemmas, but does not present any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct link to source code or an explicit statement about its public availability. |
| Open Datasets | Yes | The first dataset, Pr Onto QA (Saparov & He, 2023), consists of question-answering examples that include chains of thought, which explicitly describe the reasoning necessary to answer the questions correctly. The second dataset, Tiny Stories (Eldan & Li, 2023), is a synthetic corpus of short stories containing only words typically understood by children aged 3 to 4 years. |
| Dataset Splits | Yes | For memory mapping, all data are contained within the training set Dmem, and no test set is employed, as the generalization is not considered in this framework. For reasoning mapping, we define a set of masked anchor combinations M ={(ap+1, ap+2, , ap+q) | ap+i Arsn, i = 1, , q} and designate all sequences containing any masked combination (ap+1, ap+2, , ap+q) M as the test set Drsn,test and set the rest sequence as Drsn,train. The training set is Dtrain = Dmem Drsn,train. |
| Hardware Specification | No | The paper mentions usage of 'HPC of School of Mathematical Sciences and the Student Innovation Center, and the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University' but does not specify exact GPU/CPU models or detailed computer specifications. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' and 'GPT-2 model' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For those experiments about the Transformer structure, we train three Transformer models on a dataset of 200,000 samples, with each input sequence having a fixed length of 9 tokens. The vocabulary size dvob is set to 200, and the model architecture includes an embedding dimension dm of 200, a feedforward dimension df of 512, and query-key-value projection dimension dk of 64. The Transformer-based model uses 2 decoder layers with 1 attention head per layer. The training is conducted for 1000 epochs with a batch size of 100, and gradient clipping is applied with a maximum norm of 1. The Adam W optimizer is employed with an initial learning rate of 1 10 5. The initialization rates of the three models are γ = 0.3, 0.5, 0.8. |