On the Emergence of Position Bias in Transformers
Authors: Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. |
| Researcher Affiliation | Academia | 1MIT IDSS & LIDS 2MIT CSAIL 3TU Munich. Correspondence to: Xinyi Wu <EMAIL>. |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured, code-like algorithmic descriptions. |
| Open Source Code | Yes | Our code is available at github.com/xinyiwu98/position-biasin-attention. |
| Open Datasets | No | To ensure a controlled setup that enables precise manipulation of positional biases in the data, we adopt the synthetic data-generating process and simplified self-attention network framework proposed in Reddy (2024). ... Each xi is sampled from a Gaussian mixture model... |
| Dataset Splits | No | The paper describes generating 'training sequences' and 'test sequences' with specific structures, and mentions 'three pairs of test sets, each containing 10,000 sequences'. However, it does not provide explicit percentages or sample counts for the training/validation/test splits from a single dataset, nor does it refer to predefined standard splits for reproducibility. |
| Hardware Specification | Yes | We trained all of our models on a Tesla V100 GPU. |
| Software Dependencies | No | All models were implemented with Py Torch (Paszke et al., 2019). |
| Experiment Setup | Yes | Following Reddy (2024), we set n = 8 and d = 64. ... we set γ = 0.75, K = 2048, L = 32, and B = 4. ... For the decay mask, we set m = log(0.8) ≈ 0.223. For Ro PE, we set θi = 10000 2(i − 1)/d... we used the Adam W optimizer... with a learning rate of 10 −3, a weight decay of 10 −6, a batch size of 128, and trained for 100, 000 iterations. |