reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Auto-Regressive Next-Token Prediction: In-context Learning Emerges from Generalization

Authors: Zixuan Gong, Xiaolin Hu, Huayi Tang, Yong Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets. ... We perform experiments on numerical linear dynamic system, synthetic GINC and real-word language datasets (Section 5 and Appendix D).
Researcher Affiliation	Academia	Zixuan Gong , Xiaolin Hu , Huayi Tang, Yong Liu Gaoling School of Artificial Intelligence Renmin University of China Beijing, China EMAIL
Pseudocode	No	The paper describes various algorithms (Stochastic Gradient Descent (SGD), Gradient Langevin Dynamics (GLD), Continuous Langevin Dynamics (CLD)) within the text and proofs, but it does not present any of these in a structured pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/zx-gong/ICL-Emerge.
Open Datasets	Yes	Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets. ... Experiments on Synthetic Language Dataset GINC. Inspired by Xie et al. (2021), we first perform experiments on the synthetic language dataset GINC to verify our theory. ... Experiments on Real-world Language Dataset. We further perform experiments on real-world language datasets, inspired by (Min et al., 2021; Wang et al., 2023). ... All the datasets are obtained from Hugging Face.
Dataset Splits	No	The paper describes the organization of pre-training data by number of topics (K), sequences per topic (N), and sequence length (T), and mentions an 'ICL phase' for testing. However, it does not provide specific percentages or absolute counts for conventional train/validation/test splits, nor does it refer to standard predefined splits for the datasets used.
Hardware Specification	Yes	All ICL experiments are trained and evaluated using the same GPT-2 architecture with 12 layers, 8 attention heads, and 256 dimensional embeddings, on NVIDIA 3090 GPUs. ... All experiments on GINC are conducted using a single 24GB NVIDIA Ge Force RTX 3090. ... All experiments are conducted using four 24GB NVIDIA Ge Force RTX 3090 and 40GB A100 GPUs.
Software Dependencies	No	The paper mentions using a 'GPT-2 architecture' and the 'Adam W optimizer', and adopting code from other papers (Xie et al., 2021; Wang et al., 2023). However, it does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow, Hugging Face Transformers), or other key software components.
Experiment Setup	Yes	We train the GPT-2 model with GINC dataset using ... the Adam W optimizer with a batch size of 8 and a linear learning rate schedule. The schedule includes a warmup phase of 1000 steps, up to the learning rate of 8e-4. ... We train the GPT2-large model with a batch size of 16 and a learning rate of 1e-4 for total 30,000 iterations.