Transformers Handle Endogeneity in In-Context Linear Regression

Authors: Haodong Liang, Krishna Balasubramanian, Lifeng Lai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the 2SLS method, in the presence of endogeneity. We conduct a simulation study to evaluate the performance of the ICL-pretrained transformer model in handling endogeneity.
Researcher Affiliation Academia Haodong Liang UC Davis EMAIL Krishnakumar Balasubramanian UC Davis EMAIL Lifeng Lai UC Davis EMAIL
Pseudocode Yes Algorithm 1 In-Context Distribution P Algorithm 2 Extracting the regression coefficients
Open Source Code No The paper does not contain any explicit statements about code release or links to code repositories. Phrases like "We release our code" or "The source code is available at" are absent.
Open Datasets Yes We use the dataset from the study of Angrist & Evans (1998).
Dataset Splits Yes We set the maximum input sample size to 51 (n = 50 training samples and one query sample)... For each run we randomly select 50 samples from the dataset, and make the boxplot of the estimated β over 500 runs.
Hardware Specification Yes The training of the transformer in our experiment was conducted on a Windows 11 machine with the following specifications: GPU: NVIDIA GeForce RTX 4090 CPU: Intel Core i9-14900KF Memory: 32 GB DDR5, 5600MHz
Software Dependencies No The paper mentions "GPT-2 settings" for the transformer backbone but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We set the maximum input sample size to 51 (n = 50 training samples and one query sample), the dimension of endogenous variable p = 5, and the dimension of instrument q = 10. The backbone of the transformer block is initialized using GPT-2 settings, with 12 attention heads (M = 12), 80-dimensional embedding space (D = 80) and 2 layers (L0 = 2)... We employ the looped transformer architecture, consisting of 10 identical cascading transformer blocks. The transformer model is trained under the ICL loss (11) with a batch size of N = 64, over a total of 300,000 training steps. The noise level σϵ is set to 1.