Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

Authors: Andrew Jesson, Nicolas Beltran-Velez, David Blei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using large language models, we empirically evaluate our method on tasks involving tabular data, imaging data, and natural language data.
Researcher Affiliation Academia Correspondence to EMAIL. Department of Statistics, Columbia University. Department of Computer Science, Columbia University.
Pseudocode Yes Algorithm 1 bpgpc Algorithm 2 bpppc Algorithm 3 bplite gpc
Open Source Code No The paper does not contain any explicit statements about making their code publicly available, nor does it provide links to any code repositories for the methodology described.
Open Datasets Yes The in-capability data are the SST2 (Socher et al., 2013) sentiment analysis (positive vs. negative) and AG News Zhang et al. (2015) topic classification (World, Sports, Business, Sci/Tech) datasets. The out-of-capability data are the Medical Questions Pairs (MQP) (Mc Creery et al., 2020) differentiation (similar vs. different) and RTE (Dagan et al., 2006) natural language inference (entailment vs. not entailment) datasets. For imaging ICL experiments, we use SVHN for in-distribution data (Netzer et al., 2011), MNIST as near OOD data (Le Cun et al., 1998), and CIFAR-10 as far OOD data (Krizhevsky et al., 2009).
Dataset Splits Yes For tabular tasks, ... The training data comprise 8000 unique in-distribution datasets with 2000 z y examples each. ... In-distribution test data comprise a set of 200 new random datasets with 500 z y examples each. The OOD test data comprise 200 random datasets with 500 z y examples each. For imaging ICL experiments, we use SVHN for in-distribution data (Netzer et al., 2011). Our Llama-2 regression model ... is fit to random sequences of 16 images from the SVHN "extra" split, which has over 500k examples.
Hardware Specification No The paper mentions the use of 'Llama-2 regression model' and 'pre-trained Llama-2 7B' and 'Gemma-2 9B LLMs' but does not specify the underlying hardware (e.g., GPU models, CPU types) on which these models were run or trained for the experiments.
Software Dependencies No The paper mentions using specific models like "Llama-2 7B" and "Gemma-2 9B" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of their method.
Experiment Setup Yes The parameters for Algorithm 1 are M = 40 replications and N n = 200 generated examples. The ICL dataset xn size is varied from n = 2 to n = 200. Next, we evaluate whether the generative predictive p-value effectively predicts out-of-capacity natural language tasks. The parameters for Algorithm 1 are M = 20 replications and N n = 10 generated examples. The ICL dataset xn size is varied from n = 4 to n = 64. For generative fill... The parameters for Algorithm 1 are M = 100 replications and N n = 8 generated examples The ICL dataset xn size is varied from n = 2 to n = 8. Selecting a significance level α yields a binary predictor of model capability 1{pgpc < α}; a model is predicted incapable if the estimated generative predictive p-value is less than the significance level. We report results for significance levels α [0.01, 0.05, 0.1, 0.2, 0.5].