Towards Auto-Regressive Next-Token Prediction: In-context Learning Emerges from Generalization

Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets. ... We perform experiments on numerical linear dynamic system, synthetic GINC and real-word language datasets (Section 5 and Appendix D).
Researcher Affiliation Academia Zixuan Gong , Xiaolin Hu , Huayi Tang, Yong Liu Gaoling School of Artificial Intelligence Renmin University of China Beijing, China EMAIL
Pseudocode No The paper describes various algorithms (Stochastic Gradient Descent (SGD), Gradient Langevin Dynamics (GLD), Continuous Langevin Dynamics (CLD)) within the text and proofs, but it does not present any of these in a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at https://github.com/zx-gong/ICL-Emerge.
Open Datasets Yes Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets. ... Experiments on Synthetic Language Dataset GINC. Inspired by Xie et al. (2021), we first perform experiments on the synthetic language dataset GINC to verify our theory. ... Experiments on Real-world Language Dataset. We further perform experiments on real-world language datasets, inspired by (Min et al., 2021; Wang et al., 2023). ... All the datasets are obtained from Hugging Face.
Dataset Splits No The paper describes the organization of pre-training data by number of topics (K), sequences per topic (N), and sequence length (T), and mentions an 'ICL phase' for testing. However, it does not provide specific percentages or absolute counts for conventional train/validation/test splits, nor does it refer to standard predefined splits for the datasets used.
Hardware Specification Yes All ICL experiments are trained and evaluated using the same GPT-2 architecture with 12 layers, 8 attention heads, and 256 dimensional embeddings, on NVIDIA 3090 GPUs. ... All experiments on GINC are conducted using a single 24GB NVIDIA Ge Force RTX 3090. ... All experiments are conducted using four 24GB NVIDIA Ge Force RTX 3090 and 40GB A100 GPUs.
Software Dependencies No The paper mentions using a 'GPT-2 architecture' and the 'Adam W optimizer', and adopting code from other papers (Xie et al., 2021; Wang et al., 2023). However, it does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow, Hugging Face Transformers), or other key software components.
Experiment Setup Yes We train the GPT-2 model with GINC dataset using ... the Adam W optimizer with a batch size of 8 and a linear learning rate schedule. The schedule includes a warmup phase of 1000 steps, up to the learning rate of 8e-4. ... We train the GPT2-large model with a batch size of 16 and a learning rate of 1e-4 for total 30,000 iterations.