reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

Authors: Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply this approach in two decoding procedures: first, an adaptive best-of-k procedure that dynamically selects the number of samples to generate as input to a reranker; second, a routing procedure that dynamically responds to a query using a decoding procedure that is expensive but accurate, or one that is cheaper but less capable. Across a suite of programming, mathematics, and dialog tasks, we show that accurate computation-allocation procedures can be learned, and reduce computation by up to 50% at no cost to response quality, or improve quality by up to 10% at a fixed computational budget. ... For the best-of-k setting, we present results across 3 domains: Math, Code and Chat. ... Figures 4a and 4b present the results for the full and tranches variant respectively. ... Figure 5 shows that for both experiments, the learned predictor effectively routes the more challenging queries to the stronger model, while simpler queries are handled by the weaker model.
Researcher Affiliation	Academia	Mehul Damani Idan Shenfeld Andi Peng Andreea Bobu Jacob Andreas Massachusetts Institute of Technology Correspondence to EMAIL.
Pseudocode	No	Eq. (5) is an integer linear program. But under the assumption of decreasing marginal benefits from additional computation, it has a special form that allows it to be solved in O(Bmax n) time using a greedy algorithm that incrementally turns on cij for which ij is largest. We exploit this property in two ways: Online allocation: If queries {xi} are known a priori, we simply replace each ij in Eq. (5) with the corresponding estimate from ˆ (xi; θ), solve for cij, then finally set each bi = maxj cij. Offline allocation: If allocations must be made without access to a full batch of samples, we construct a fixed allocation policy as follows: (1) Hold out a subset of the data used for training the reward estimator ˆ , then use it to label queries in this held-out set (e.g. based on the first-sample prediction ˆ (xi)1). Divide these queries into a fixed set of bins according to their predicted marginal rewards. (2) Solve the allocation problem for this held-out as in Eq. (5), with the additional constraint that all queries in a bin receive the same budget allocation. For each bin, store the assigned budget. (3) During deployment, compute a reward prediction for each xi, map it to a bin, and return the budget associated with that bin.
Open Source Code	No	The paper does not explicitly state that the authors are releasing their code or provide a link to a code repository. The mention of 'The official open-source evaluation framework released with TACO' refers to a third-party tool, not the authors' own implementation of their methodology.
Open Datasets	Yes	For the coding experiments, we adopt a subset of TACO, a dataset focused on algorithmic code generation with problems sourced from various programming contest platforms (Li et al., 2023). ... We use a subset of the Numina-COT Dataset (Li et al., 2024), which contains Math problems obtained from diverse sources. ... For chat, we use a subset of the LMSYS-Chat dataset, which contains real-world conversations with 25 LLMs (Zheng et al., 2024). ... We use the harmless subset of the popular Anthropic HH dataset (Bai et al., 2022). ... We implement our adaptive best-of-k pipeline on the popular MATH benchmark (Hendrycks et al., 2021). ... We implement our routing setting on the popular GSM8K dataset (Cobbe et al., 2021).
Dataset Splits	Yes	Our final dataset contained 15K training samples, 2K validation samples and 3K test samples. (Appendix A.1, Math) ... Our final dataset consisted of 10K training samples, 1K validation samples and 1K test samples. (Appendix A.2, Code) ... Our final dataset consisted of 50K training samples and 5K test samples. The tranches experiment uses a subset of 20% from the full test set. (Appendix A.3, Chat) ... We sampled 8K samples randomly from the train set, and 400 samples from the test set. (Appendix A.5, Routing: VAS)
Hardware Specification	No	The paper mentions various models (e.g., Starcoder-15B, Mistral AI's Mathstral-7B, Gemma-7b-it) and the concept of 'computational resources', but it does not provide any specific details about the hardware (CPU, GPU, memory, etc.) used to conduct the experiments.
Software Dependencies	No	The paper mentions several language models and reward models by name (e.g., Starcoder-15B, Mistral AI's Mathstral-7B, Gemma-7b-it, Llama-2 7B, Llama-3.1-8B-Instruct, NCSOFT/Llama-3-Offset Bias-RM-8B, Open Assistant/reward-model-deberta-v3-large-v2) and an 'evaluation framework of Hendrycks et al.'. However, it does not specify version numbers for any of these software components, nor for any programming languages or general libraries used.
Experiment Setup	Yes	We generate 128 responses for every query (temperature=0.7) and label them using the verification pipeline. ... Our adaptive allocation algorithms are assigned a maximum budget of Bmax = 128 samples per query (that is the allocation for any query is capped at 128 samples). (Appendix A.1) ... We generate 100 responses for every query (temperature=0.7) and label them using the unit-test verifier... Our adaptive allocation algorithms are assigned a maximum budget of Bmax = 100 samples per query... (Appendix A.2) ... We generate 8 responses (temperature=0.7) for every query... Our adaptive allocation algorithm was assigned a maximum budget of Bmax = 8 samples per query. (Appendix A.3)