reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Emergence of Position Bias in Transformers

Authors: Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs.
Researcher Affiliation	Academia	1MIT IDSS & LIDS 2MIT CSAIL 3TU Munich. Correspondence to: Xinyi Wu <EMAIL>.
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured, code-like algorithmic descriptions.
Open Source Code	Yes	Our code is available at github.com/xinyiwu98/position-biasin-attention.
Open Datasets	No	To ensure a controlled setup that enables precise manipulation of positional biases in the data, we adopt the synthetic data-generating process and simplified self-attention network framework proposed in Reddy (2024). ... Each xi is sampled from a Gaussian mixture model...
Dataset Splits	No	The paper describes generating 'training sequences' and 'test sequences' with specific structures, and mentions 'three pairs of test sets, each containing 10,000 sequences'. However, it does not provide explicit percentages or sample counts for the training/validation/test splits from a single dataset, nor does it refer to predefined standard splits for reproducibility.
Hardware Specification	Yes	We trained all of our models on a Tesla V100 GPU.
Software Dependencies	No	All models were implemented with Py Torch (Paszke et al., 2019).
Experiment Setup	Yes	Following Reddy (2024), we set n = 8 and d = 64. ... we set γ = 0.75, K = 2048, L = 32, and B = 4. ... For the decay mask, we set m = log(0.8) ≈ 0.223. For Ro PE, we set θi = 10000 2(i − 1)/d... we used the Adam W optimizer... with a learning rate of 10 −3, a weight decay of 10 −6, a batch size of 128, and trained for 100, 000 iterations.