reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large (Vision) Language Models are Unsupervised In-Context Learners

Authors: Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbic

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language Open Flamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset.
Researcher Affiliation	Academia	Artyom Gadetsky Andrei Atanov Yulun Jiang Zhitong Gao Ghazal Hosseini Mighan Amir Zamir Maria Brbi c Swiss Federal Institute of Technology (EPFL)
Pseudocode	Yes	Algorithm B1 Amortized Approach 1: Input: Dataset D, Foundation model p FM( ), hyperparameter N, Lo RA task encoder τθ( ) with parameters θ, regularization strength γ, number of iterations T, learning rate α, batch size B 2: Initialize θ0 such that τθ0 = p FM 3: for t = 0 to T 1 do 4: Sample mini-batch xb 1, . . . , xb N D, b = 1, . . . , B 5: Sample answers yb n τθt( \|xb n), n = 1, . . . , N; b = 1, . . . , B 6: Estimate τ prior θt ( ) = 1 N B PB b=1 PN n=1 τθt( \|xb n) 7: Compute the objective Ot = 1 B PB b=1 PN n=1 J N n (y1, . . . , yn) + γR(τ prior θt ) 8: Compute the gradient estimator gt via Eq. (17) 9: Updante the parameters: θt+1 = θt + αgt 10: end for 11: Produce answers yn = arg max y Y τθT (y\|x)for all x D 12: Output: Answers for D Algorithm B2 Multi-Turn Approach 1: Input: Dataset D, Foundation model p FM( ), hyperparameter N, number of turns T, number of repeats Nr 2: Initialize answers with zero-shot predictions: D0 = {(x, y) \| x D, y p FM( \|x)} 3: for t = 1 to T do 4: Initialize Dt = 5: for x D do 6: for n = 1 to Nr do 7: Sample support examples labeled by previous turn: (x1, yt 1 1 ), . . . , (x N, yt 1 N ) Dt 1 8: Obtain answer: yx n p FM( \|x, (x1, yt 1 1 ), . . . , (x N, yt 1 N )) 9: end for 10: Take majority vote over Nr options: yx = MAJ(yx 1, . . . , yx Nr) 11: Update answers: Dt = Dt {yx} 12: end for 13: end for 14: Take answers from the last turn: {yn \| (xn, yn) DT } 15: Output: Answers for D
Open Source Code	Yes	Code is publicly available at https://github.com/mlbio-epfl/joint-inference.
Open Datasets	Yes	We evaluate our methods on 13 benchmark datasets, spanning various NLP tasks. For sentiment analysis, we use SST2 (Socher et al., 2013), ... and Amazon (Mc Auley & Leskovec, 2013), ... For topic classification, we use AG-News (Zhang et al., 2015), ... TREC (Voorhees & Tice, 2000), and DBpedia-14 (Lehmann et al., 2015), ... SUBJ (Pang & Lee, 2004) is used for classifying sentences as subjective or objective. For natural language inference, we use RTE (Wang et al., 2018) ... QNLI (Rajpurkar et al., 2016) ... and MNLI (Williams et al., 2018), ... We also include COPA (Roemmele et al., 2011) and Hella Swag (Zellers et al., 2019) for story completion, Bool Q (Clark et al., 2019) for yes/no question answering, and PIQA (Bisk et al., 2020) for physical commonsense reasoning. For open-ended questions, GSM8K (Cobbe et al., 2021) ... MMLU (Hendrycks et al., 2021) and MMLU-Pro (Wang et al., 2024) ... We evaluate our method on ten vision datasets, including four image classification tasks and six visual question-answering tasks. For image classification, we use CIFAR10 (Krizhevsky et al., 2009), ... CIFAR100 (Krizhevsky et al., 2009) and Image Net-100 (Deng et al., 2009), ... We also include Food101 (Bossard et al., 2014), ... For visual question answering, we use COCO-Color and COCO-Number, both derived from VQAv2 (Goyal et al., 2018), VQAv2 itself and Viz Wiz (Gurari et al., 2018). Furthermore, we use challenging MMMU (Yue et al., 2024a) and MMMU-Pro (Yue et al., 2024b) datasets.
Dataset Splits	Yes	For each dataset, we randomly sample 2,000 examples as the train split for unsupervised learning, and 1,000 examples as the test split for evaluation (except for COPA where there are only 500 examples in total). We balance labels in both train split and test split. For GSM8K (Cobbe et al., 2021), we use the whole test set which contains 1319 examples for the evaluation.
Hardware Specification	Yes	The typical training time is 12h for text tasks and 4h for vision tasks, on one NVIDIA H100 GPU.
Software Dependencies	No	For NLP tasks with Llama-3.1, we also use flash-attention (Dao et al., 2022) and 4-bit quantization of the model provided by the Unsloth library 2 to improve efficiency. We found that with improved gradient estimator, the training is less sensitive to the hyper-parameters. Thus we do not customize hyperparamters for each datasets, and instead using a learning rate of 1e-5 with Adam optimizer for all datasets. The model is fine-tuned for 6,000 iterations and usually the training converges at around 2,000 iterations. We train our model with 64 examples at each mini-batch. We use context-length N = 16 for the main experiments and provide ablation study on the effect of N at Appendix D.1. Similarly, for vision experiments, we train our model for 3,000 iterations with a learning rate of 1e-4, and 256 examples at each iteration. The typical training time is 12h for text tasks and 4h for vision tasks, on one NVIDIA H100 GPU.
Experiment Setup	Yes	Thus we do not customize hyperparamters for each datasets, and instead using a learning rate of 1e-5 with Adam optimizer for all datasets. The model is fine-tuned for 6,000 iterations and usually the training converges at around 2,000 iterations. We train our model with 64 examples at each mini-batch. We use context-length N = 16 for the main experiments and provide ablation study on the effect of N at Appendix D.1. Similarly, for vision experiments, we train our model for 3,000 iterations with a learning rate of 1e-4, and 256 examples at each iteration.