reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation

Authors: Xingchen Wan, Han Zhou, Ruoxi Sun, Sercan Arik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On Gemini, Claude, and Mistral LLMs of different sizes, we show BRIDGE led to significant improvements across a diverse set of tasks including symbolic reasoning, numerical reasoning and code generation.
Researcher Affiliation	Collaboration	Xingchen Wan1, Han Zhou1,3 , Ruoxi Sun2 , Sercan O. Arık1 1Google Cloud AI Research 2Google DeepMind 3University of Cambridge Work done at Google Cloud AI Research EMAIL
Pseudocode	Yes	Algorithm 1 BRIDGE. 1: Input: train set Dt, validation set Dv (can be the same as the train set), number of iteration rounds K N (outer-loop), evaluation budget for BO per iteration neval (inner-loop).
Open Source Code	No	The paper does not contain an explicit statement by the authors providing access to their source code for the BRIDGE methodology. It only provides links to datasets used for evaluation.
Open Datasets	Yes	The BBH dataset is publicly available at https://github.com/suzgunmirac/BIG-Bench-Hard under an MIT license. For all BBH tasks, we use the prompt templates below: ... The MATH dataset is available at https://github.com/hendrycks/math and GSM-Hard is available at https://huggingface.co/datasets/reasoning-machines/gsm-hard. Both datasets are license under an MIT license. ... All data, including the databases, schemas and ground-truth gold SQL are available at the official repo: https://bird-bench.github.io under a CC BY-SA 4.0 licence.
Dataset Splits	Yes	For all tasks, we randomize the data points and reserve 40% (usually 100 samples, but some sub-tasks of BBH benchmark have fewer data-points) as held-out sets for testing, whose inputs and labels are not revealed to the model except for final evaluation. For the rest of the dataset, in Sec. 2, we use 50% (30% of all available data points including the held-out test set) as the train-set from which the examples are generated and the other 50% for validation (i.e., the split where results in Fig. 4 is generated). In Sec. 4, we do not use the aforementioned validation set and use performance on the same set that generates the examples as the optimization objective. ... On BIRD, we randomly sample 128 samples from the train split as the unified train and validation set and use the official test set (of 1534 data points) for testing.
Hardware Specification	No	The paper mentions using specific LLM models (Gemini 1.5 Pro, Gemini 1.5 Flash, Mistral Ne Mo, Mistral Large, Claude 3.5 Sonnet) but does not specify the underlying hardware (e.g., GPU models, CPU types, or TPU versions) on which these models were run for the experiments.
Software Dependencies	No	The paper mentions using specific packages like 'gpytorch (Gardner et al., 2018) or botorch (Balandat et al., 2020)' but does not provide version numbers for these or any other software dependencies crucial for replication.
Experiment Setup	Yes	For all tasks, we run BRIDGE with K = 3 rounds (i.e., the number of outer-loop iterations in Algorithm 1) and within each round, we allow for neval = 32 evaluations on the validation set (i.e., the number of inner-loop iterations in Algorithm 2) and we report the results at the end of each optimize and generate steps to visualize the iteration process. For baselines, we consider 1) using all provided examples and we consider three variants: a) using query-target only without any generated rationales (Direct), b) first prompt the LLM to generate rationales and answers, and use the concatenation of query-rationale-target as demonstrations, regardless of whether the rationale led to the correct answer (Co T), and c) prompting the LLM with both the query and the final, ground-truth answer to fill in the rationale this technique has been variously referred to as, e.g., infilling (Hu et al., 2023), rationalization (Zelikman et al., 2022), or more generally, teacher forcing (Chen et al., 2025) due to its conceptual similarity to teacher forcing in recurrent neural network (RNN) training (Lamb et al., 2016) (Infill); 2) reinforced ICL (Agarwal et al., 2024), where all available input-output pairs from the correct predictions on the train set with zero-shot prompting are used; and 3) an iterative variant of reinforced ICL which can also be seen as BRIDGE without the optimize step: while we repeat the generation process on the train set K = 3 times, we do not first aim to select the optimized subset but instead use the entire generated examples from the previous step as demonstrations Ek f LLM(Dt, Ek 1). ... we set {βLB, βUB} denote the lower and upper bounds of the weight for g( ) which are set to {0.25, 1} by default.