On the Emergence of Position Bias in Transformers

Authors: Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs.
Researcher Affiliation Academia 1MIT IDSS & LIDS 2MIT CSAIL 3TU Munich. Correspondence to: Xinyi Wu <EMAIL>.
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured, code-like algorithmic descriptions.
Open Source Code Yes Our code is available at github.com/xinyiwu98/position-biasin-attention.
Open Datasets No To ensure a controlled setup that enables precise manipulation of positional biases in the data, we adopt the synthetic data-generating process and simplified self-attention network framework proposed in Reddy (2024). ... Each xi is sampled from a Gaussian mixture model...
Dataset Splits No The paper describes generating 'training sequences' and 'test sequences' with specific structures, and mentions 'three pairs of test sets, each containing 10,000 sequences'. However, it does not provide explicit percentages or sample counts for the training/validation/test splits from a single dataset, nor does it refer to predefined standard splits for reproducibility.
Hardware Specification Yes We trained all of our models on a Tesla V100 GPU.
Software Dependencies No All models were implemented with Py Torch (Paszke et al., 2019).
Experiment Setup Yes Following Reddy (2024), we set n = 8 and d = 64. ... we set γ = 0.75, K = 2048, L = 32, and B = 4. ... For the decay mask, we set m = log(0.8) ≈ 0.223. For Ro PE, we set θi = 10000 2(i − 1)/d... we used the Adam W optimizer... with a learning rate of 10 −3, a weight decay of 10 −6, a batch size of 128, and trained for 100, 000 iterations.