reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

Authors: Andrew Jesson, Nicolas Beltran-Velez, David Blei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using large language models, we empirically evaluate our method on tasks involving tabular data, imaging data, and natural language data.
Researcher Affiliation	Academia	Correspondence to EMAIL. Department of Statistics, Columbia University. Department of Computer Science, Columbia University.
Pseudocode	Yes	Algorithm 1 bpgpc Algorithm 2 bpppc Algorithm 3 bplite gpc
Open Source Code	No	The paper does not contain any explicit statements about making their code publicly available, nor does it provide links to any code repositories for the methodology described.
Open Datasets	Yes	The in-capability data are the SST2 (Socher et al., 2013) sentiment analysis (positive vs. negative) and AG News Zhang et al. (2015) topic classification (World, Sports, Business, Sci/Tech) datasets. The out-of-capability data are the Medical Questions Pairs (MQP) (Mc Creery et al., 2020) differentiation (similar vs. different) and RTE (Dagan et al., 2006) natural language inference (entailment vs. not entailment) datasets. For imaging ICL experiments, we use SVHN for in-distribution data (Netzer et al., 2011), MNIST as near OOD data (Le Cun et al., 1998), and CIFAR-10 as far OOD data (Krizhevsky et al., 2009).
Dataset Splits	Yes	For tabular tasks, ... The training data comprise 8000 unique in-distribution datasets with 2000 z y examples each. ... In-distribution test data comprise a set of 200 new random datasets with 500 z y examples each. The OOD test data comprise 200 random datasets with 500 z y examples each. For imaging ICL experiments, we use SVHN for in-distribution data (Netzer et al., 2011). Our Llama-2 regression model ... is fit to random sequences of 16 images from the SVHN "extra" split, which has over 500k examples.
Hardware Specification	No	The paper mentions the use of 'Llama-2 regression model' and 'pre-trained Llama-2 7B' and 'Gemma-2 9B LLMs' but does not specify the underlying hardware (e.g., GPU models, CPU types) on which these models were run or trained for the experiments.
Software Dependencies	No	The paper mentions using specific models like "Llama-2 7B" and "Gemma-2 9B" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of their method.
Experiment Setup	Yes	The parameters for Algorithm 1 are M = 40 replications and N n = 200 generated examples. The ICL dataset xn size is varied from n = 2 to n = 200. Next, we evaluate whether the generative predictive p-value effectively predicts out-of-capacity natural language tasks. The parameters for Algorithm 1 are M = 20 replications and N n = 10 generated examples. The ICL dataset xn size is varied from n = 4 to n = 64. For generative fill... The parameters for Algorithm 1 are M = 100 replications and N n = 8 generated examples The ICL dataset xn size is varied from n = 2 to n = 8. Selecting a significance level α yields a binary predictor of model capability 1{pgpc < α}; a model is predicted incapable if the estimated generative predictive p-value is less than the significance level. We report results for significance levels α [0.01, 0.05, 0.1, 0.2, 0.5].