reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Authors: Shang Liu, Zhongze Cai, Guanting Chen, Xiaocheng Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In addition to developing theories, we empirically demonstrate the effectiveness of Transformers to in-context predicting the mean and quantifying the variance of regression tasks. We design a series of out-of-distribution (OOD) experiments, which have generated significant interest within the community (Garg et al. (2022); Raventós et al. (2024); Singh et al. (2024)). These experiments provide insights in designing the pretraining process and understanding the ICL capabilities of transformers.
Researcher Affiliation	Academia	Shang Liu EMAIL Imperial College Business School Imperial College London Zhongze Cai EMAIL Imperial College Business School Imperial College London Guanting Chen EMAIL Department of Statistics and Operations Research University of North Carolina Xiaocheng Li EMAIL Imperial College Business School Imperial College London
Pseudocode	No	The paper describes methods and derivations in textual format and mathematical equations, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	No	The training data is generated synthetically based on specified statistical distributions (e.g., PX: x(i) t i.i.d. N(0, Id), Pϵ: ϵ(i) t i.i.d. N(0, 1)), rather than using a pre-existing public dataset. No public access information for any dataset is provided.
Dataset Splits	No	The validation and testing sets are randomly generated for each evaluation, and the training data is generated afresh for each batch. The paper does not specify fixed or reproducible training/test/validation splits with exact percentages or sample counts, nor does it reference standard predefined splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to conduct the experiments.
Software Dependencies	No	The paper does not provide specific software dependency details with version numbers for the implementation of their work. It only mentions 'transformers package of Hugging Face' in the context of other works, without a version number.
Experiment Setup	Yes	Throughout the paper, we consider the dimension d = 8. The batch size b = 64. All the numerical experiments in our paper run for 200,000 batches. For the basic setup, the parameters for noise intensity are τ = τ = 20. For OOD experiments, specific parameters are given, e.g., S-OOD: τ = 80, τ = 20; M-OOD: τ = 100, τ = 400; L-OOD: τ = 100, τ = 1600. For length shift experiments, models are trained on prompts with lengths ranging from 1 to 44 or 45 to 100.