Transformers Handle Endogeneity in In-Context Linear Regression
Authors: Haodong Liang, Krishna Balasubramanian, Lifeng Lai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the 2SLS method, in the presence of endogeneity. We conduct a simulation study to evaluate the performance of the ICL-pretrained transformer model in handling endogeneity. |
| Researcher Affiliation | Academia | Haodong Liang UC Davis EMAIL Krishnakumar Balasubramanian UC Davis EMAIL Lifeng Lai UC Davis EMAIL |
| Pseudocode | Yes | Algorithm 1 In-Context Distribution P Algorithm 2 Extracting the regression coefficients |
| Open Source Code | No | The paper does not contain any explicit statements about code release or links to code repositories. Phrases like "We release our code" or "The source code is available at" are absent. |
| Open Datasets | Yes | We use the dataset from the study of Angrist & Evans (1998). |
| Dataset Splits | Yes | We set the maximum input sample size to 51 (n = 50 training samples and one query sample)... For each run we randomly select 50 samples from the dataset, and make the boxplot of the estimated β over 500 runs. |
| Hardware Specification | Yes | The training of the transformer in our experiment was conducted on a Windows 11 machine with the following specifications: GPU: NVIDIA GeForce RTX 4090 CPU: Intel Core i9-14900KF Memory: 32 GB DDR5, 5600MHz |
| Software Dependencies | No | The paper mentions "GPT-2 settings" for the transformer backbone but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We set the maximum input sample size to 51 (n = 50 training samples and one query sample), the dimension of endogenous variable p = 5, and the dimension of instrument q = 10. The backbone of the transformer block is initialized using GPT-2 settings, with 12 attention heads (M = 12), 80-dimensional embedding space (D = 80) and 2 layers (L0 = 2)... We employ the looped transformer architecture, consisting of 10 identical cascading transformer blocks. The transformer model is trained under the ICL loss (11) with a batch size of N = 64, over a total of 300,000 training steps. The noise level σϵ is set to 1. |