reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimizing Temperature for Language Models with Multi-Sample Inference

Authors: Weihua Du, Yiming Yang, Sean Welleck

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a comprehensive analysis of temperature s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Through extensive experiments, TURN has demonstrated strong generalizability across diverse tasks (e.g., mathematical problem-solving, code generation), model sizes, and aggregation strategies (e.g., majority voting, best-of-N). It consistently outperforms baseline methods using a fixed temperature, yielding significant performance improvements. We evaluated 13 models on two tasks MATH (with majority voting) and MBPP (with Best-of-N) and present the results in Table 1.
Researcher Affiliation	Academia	1Language Technologies Institute, Carnegie Mellon University. Correspondence to: Weihua Du <EMAIL>, Yiming Yang <EMAIL>, Sean Welleck <EMAIL>.
Pseudocode	Yes	Algorithm 1 Turning Point Temperature Selection (TURN) 1: Input: Language Model M, task T = (X1, ..., Xk), temperature interval t, maximum temperature tmax, sample size N, aggregation method A. 2: Output: Predicted Temperature Tpred. 3: Compute J = tmax/t {Number of choices} 4: Initialize entropy list E = [] 5: for n = 1 to N do 6: Randomly select Xi from T 7: for j = 0 to J do 8: Generate a sample Y using M with T = j t 9: Compute token-level entropy of Y , add to E[j] 10: end for 11: end for 12: Compute H(j) = Mean (E(j)) for all j 13: Compute ℓ(j) = log H(j) for all j 14: Find j = arg minj d2ℓ dt2 > 0 15: Compute t = j t 16: Add adaptation factor βA: Tpred = t + βA 17: Return Tpred
Open Source Code	Yes	Our code is available at https://github.com/Stig Lidu/dualdistill.
Open Datasets	Yes	Math Problem Solving: We assess language models reasoning abilities using the MATH dataset (Hendrycks et al., 2021), which consists of competition-level math problems. ... Code Generation: For code generation, we use the MBPP dataset (Austin et al., 2021), selecting the first 100 programming problems.
Dataset Splits	Yes	To accommodate multiple models, we randomly select 200 test problems (40 per difficulty level). ... For code generation, we use the MBPP dataset (Austin et al., 2021), selecting the first 100 programming problems.
Hardware Specification	Yes	All experiments can be reproduced on a single L40S or A6000 GPU.
Software Dependencies	No	Our experiments build upon two open-source projects: Easy-to-Hard Generalization (Sun et al., 2024) for the MATH dataset and bigcode-evaluation-harness (Ben Allal et al., 2022) for the MBPP dataset. We employ v LLM (Kwon et al., 2023) to accelerate inference. The text does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For both tasks, we sample 256 times per question at each temperature level and compute accuracy across different sampling sizes. For temperature prediction in TURN, we use an interval of t = 0.1 and set N = 8 dataset size (an excessive sample size, see Section 5.4 for discussion). The maximum output length is set to 1024 tokens for all tasks. For the MATH dataset, we use top-k sampling with k = 20.