Optimizing Temperature for Language Models with Multi-Sample Inference
Authors: Weihua Du, Yiming Yang, Sean Welleck
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a comprehensive analysis of temperature s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Through extensive experiments, TURN has demonstrated strong generalizability across diverse tasks (e.g., mathematical problem-solving, code generation), model sizes, and aggregation strategies (e.g., majority voting, best-of-N). It consistently outperforms baseline methods using a fixed temperature, yielding significant performance improvements. We evaluated 13 models on two tasks MATH (with majority voting) and MBPP (with Best-of-N) and present the results in Table 1. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University. Correspondence to: Weihua Du <EMAIL>, Yiming Yang <EMAIL>, Sean Welleck <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Turning Point Temperature Selection (TURN) 1: Input: Language Model M, task T = (X1, ..., Xk), temperature interval t, maximum temperature tmax, sample size N, aggregation method A. 2: Output: Predicted Temperature Tpred. 3: Compute J = tmax/t {Number of choices} 4: Initialize entropy list E = [] 5: for n = 1 to N do 6: Randomly select Xi from T 7: for j = 0 to J do 8: Generate a sample Y using M with T = j t 9: Compute token-level entropy of Y , add to E[j] 10: end for 11: end for 12: Compute H(j) = Mean (E(j)) for all j 13: Compute ℓ(j) = log H(j) for all j 14: Find j = arg minj d2ℓ dt2 > 0 15: Compute t = j t 16: Add adaptation factor βA: Tpred = t + βA 17: Return Tpred |
| Open Source Code | Yes | Our code is available at https://github.com/Stig Lidu/dualdistill. |
| Open Datasets | Yes | Math Problem Solving: We assess language models reasoning abilities using the MATH dataset (Hendrycks et al., 2021), which consists of competition-level math problems. ... Code Generation: For code generation, we use the MBPP dataset (Austin et al., 2021), selecting the first 100 programming problems. |
| Dataset Splits | Yes | To accommodate multiple models, we randomly select 200 test problems (40 per difficulty level). ... For code generation, we use the MBPP dataset (Austin et al., 2021), selecting the first 100 programming problems. |
| Hardware Specification | Yes | All experiments can be reproduced on a single L40S or A6000 GPU. |
| Software Dependencies | No | Our experiments build upon two open-source projects: Easy-to-Hard Generalization (Sun et al., 2024) for the MATH dataset and bigcode-evaluation-harness (Ben Allal et al., 2022) for the MBPP dataset. We employ v LLM (Kwon et al., 2023) to accelerate inference. The text does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For both tasks, we sample 256 times per question at each temperature level and compute accuracy across different sampling sizes. For temperature prediction in TURN, we use an interval of t = 0.1 and set N = 8 dataset size (an excessive sample size, see Section 5.4 for discussion). The maximum output length is set to 1024 tokens for all tasks. For the MATH dataset, we use top-k sampling with k = 20. |