An Architecture Search Framework for Inference-Time Techniques
Authors: Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ARCHON architectures across a diverse set of instruction-following, reasoning, and coding benchmarks (Table 1): MT-Bench, Arena-Hard-Auto, Alpaca-2.0 Eval, Mix Eval, MATH, and Code Contests (Zheng et al., 2023; Li et al., 2024b; 2023; Ni et al., 2024; Hendrycks et al., 2021; Li et al., 2022). Our best ARCHON architectures surpass both frontier models (e.g. Open AI s O1, GPT-4o and Claude-3.5 Sonnet) and prior top-performing inference-time architectures (e.g. ADAS, AFlow, and Mo A), boosting state-of-the-art (SOTA) performance by 15.1%, on average. |
| Researcher Affiliation | Academia | Jon Saad-Falcon 1 Adrian Gamarra Lafuente 1 Shlok Natarajan 1 Nahum Maru 1 Hristo Todorov 1 Etash Guha 2 E. Kelly Buchanan 1 Mayee Chen 1 Neel Guha 1 Christopher RĂ© 1 Azalia Mirhoseini 1 ... 1Stanford University, Stanford, CA, USA 2University of Washington, Seattle, WA, USA. Correspondence to: Jon Saad-Falcon <EMAIL>. |
| Pseudocode | No | The paper describes the components and rules for ARCHON's construction and architecture search in detail (Sections 3.1, 3.2, 3.3, and Appendix A.4). It includes figures illustrating the framework's flow and example architectures (Figure 2, Figure 3, Figure 13-16). However, it does not contain a dedicated section or block explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there structured, code-like procedural steps presented in that format within the main text or appendices. |
| Open Source Code | Yes | Overall, we present ARCHON as an open-source inference-time framework, readily extensible to new inference-time techniques, models, and tasks via user-friendly interfaces. |
| Open Datasets | Yes | We evaluate our models with several benchmarks for instruction-following, reasoning, and coding: MT-Bench (Zheng et al., 2023), Alpaca Eval 2.0 (Li et al., 2023), Arena Hard Auto (Li et al., 2024b), Mix Eval (Ni et al., 2024), Mix Eval-Hard, MATH (Hendrycks et al., 2021), and Code Contests (Li et al., 2022). |
| Dataset Splits | Yes | Since we perform automatic architecture search on a randomly sampled 20% subset of each benchmark, we evaluate on the remaining held-out 80% subset of the benchmark (Table 1)... For MATH, we evaluate a random sample of 200 problems from the dataset s test set. For Code Contests, we evaluate on the 140 test set questions that do not include image tags in the problem description. |
| Hardware Specification | No | The paper discusses the use of LLMs with varying parameter counts (e.g., "70B+ parameters", "7B open-source models") and analyzes compute efficiency in terms of inference calls, input/output tokens, and FLOPs. It also touches on costs. However, it does not provide specific details about the underlying hardware (e.g., particular GPU models like NVIDIA A100s, specific CPU types, or cloud instance configurations) used to execute the experiments. |
| Software Dependencies | No | The paper mentions utilizing the "Bayesian Optimization python package for global optimization with Gaussian processes" in Appendix A.4. However, it does not specify the version number of this package, nor does it list specific versions for other software dependencies like Python itself or any other libraries used in their implementation. |
| Experiment Setup | Yes | Guided by the trends found in our analysis in Section 3.2, we establish six axes of hyperparameters for the search space: 1. Top-K Generators for Ensemble: The top-K models for the initial Generator ensemble, ranging from 1 to 10 (T1). 2. Top-K Generator Samples: The number of samples gathered from each ensemble generator (same for all the models), ranging from 1 to 5 (T1). For Code Contests, we explore high-sample settings: [1, 10, 100, 500, 1000]. 3. Number of Fusion Layers: Ranges from 1 to 4. The last fusion layer will always have a single Fuser (T2). 4. Top-K Fusers: Number of models used for each fusion layer, ranges from 2 to 10 in increments of 2 (T2,3). 5. Critic and Ranker Layers: We add critic and ranker layers before each fuser layer since we find they provide added benefits across the benchmarks explored (T3) (Section 3.2; Figure 4; Figure 7). 6. Evaluation Layer: Option to add Verifier, Unit Test Gen./Eval., or neither before the last Fuser layer (T4). ... For all the LLMs utilized and every ARCHON component, we set the generation temperature to 0.7. |