reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Architecture Search Framework for Inference-Time Techniques

Authors: Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate ARCHON architectures across a diverse set of instruction-following, reasoning, and coding benchmarks (Table 1): MT-Bench, Arena-Hard-Auto, Alpaca-2.0 Eval, Mix Eval, MATH, and Code Contests (Zheng et al., 2023; Li et al., 2024b; 2023; Ni et al., 2024; Hendrycks et al., 2021; Li et al., 2022). Our best ARCHON architectures surpass both frontier models (e.g. Open AI s O1, GPT-4o and Claude-3.5 Sonnet) and prior top-performing inference-time architectures (e.g. ADAS, AFlow, and Mo A), boosting state-of-the-art (SOTA) performance by 15.1%, on average.
Researcher Affiliation	Academia	Jon Saad-Falcon 1 Adrian Gamarra Lafuente 1 Shlok Natarajan 1 Nahum Maru 1 Hristo Todorov 1 Etash Guha 2 E. Kelly Buchanan 1 Mayee Chen 1 Neel Guha 1 Christopher Ré 1 Azalia Mirhoseini 1 ... 1Stanford University, Stanford, CA, USA 2University of Washington, Seattle, WA, USA. Correspondence to: Jon Saad-Falcon <EMAIL>.
Pseudocode	No	The paper describes the components and rules for ARCHON's construction and architecture search in detail (Sections 3.1, 3.2, 3.3, and Appendix A.4). It includes figures illustrating the framework's flow and example architectures (Figure 2, Figure 3, Figure 13-16). However, it does not contain a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there structured, code-like procedural steps presented in that format within the main text or appendices.
Open Source Code	Yes	Overall, we present ARCHON as an open-source inference-time framework, readily extensible to new inference-time techniques, models, and tasks via user-friendly interfaces.
Open Datasets	Yes	We evaluate our models with several benchmarks for instruction-following, reasoning, and coding: MT-Bench (Zheng et al., 2023), Alpaca Eval 2.0 (Li et al., 2023), Arena Hard Auto (Li et al., 2024b), Mix Eval (Ni et al., 2024), Mix Eval-Hard, MATH (Hendrycks et al., 2021), and Code Contests (Li et al., 2022).
Dataset Splits	Yes	Since we perform automatic architecture search on a randomly sampled 20% subset of each benchmark, we evaluate on the remaining held-out 80% subset of the benchmark (Table 1)... For MATH, we evaluate a random sample of 200 problems from the dataset s test set. For Code Contests, we evaluate on the 140 test set questions that do not include image tags in the problem description.
Hardware Specification	No	The paper discusses the use of LLMs with varying parameter counts (e.g., "70B+ parameters", "7B open-source models") and analyzes compute efficiency in terms of inference calls, input/output tokens, and FLOPs. It also touches on costs. However, it does not provide specific details about the underlying hardware (e.g., particular GPU models like NVIDIA A100s, specific CPU types, or cloud instance configurations) used to execute the experiments.
Software Dependencies	No	The paper mentions utilizing the "Bayesian Optimization python package for global optimization with Gaussian processes" in Appendix A.4. However, it does not specify the version number of this package, nor does it list specific versions for other software dependencies like Python itself or any other libraries used in their implementation.
Experiment Setup	Yes	Guided by the trends found in our analysis in Section 3.2, we establish six axes of hyperparameters for the search space: 1. Top-K Generators for Ensemble: The top-K models for the initial Generator ensemble, ranging from 1 to 10 (T1). 2. Top-K Generator Samples: The number of samples gathered from each ensemble generator (same for all the models), ranging from 1 to 5 (T1). For Code Contests, we explore high-sample settings: [1, 10, 100, 500, 1000]. 3. Number of Fusion Layers: Ranges from 1 to 4. The last fusion layer will always have a single Fuser (T2). 4. Top-K Fusers: Number of models used for each fusion layer, ranges from 2 to 10 in increments of 2 (T2,3). 5. Critic and Ranker Layers: We add critic and ranker layers before each fuser layer since we find they provide added benefits across the benchmarks explored (T3) (Section 3.2; Figure 4; Figure 7). 6. Evaluation Layer: Option to add Verifier, Unit Test Gen./Eval., or neither before the last Fuser layer (T4). ... For all the LLMs utilized and every ARCHON component, we set the generation temperature to 0.7.