DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens. In this paper, we propose Diff Sampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.
Researcher Affiliation Academia Giorgio Franceschelli EMAIL Alma Mater Studiorum Università di Bologna, Bologna, Italy Mirco Musolesi EMAIL University College London, London, United Kingdom Alma Mater Studiorum Università di Bologna, Bologna, Italy
Pseudocode Yes Algorithm 1 Diff Sampling Input: probabilities probs = [p[1] t . . . p[N] t ], lower bound p_lb = plb, upper bound p_min = pmin, temperature tau = τ. sorted_probs, indices = sort(probs) fwd_probs = sorted_probs[1:] + [0.0] delta_probs = fwd_probs sorted_probs if p_min > 0.0 then lim = argmin(sorted_probs > p_min sorted_probs[0]) 1 delta_probs[:lim] = 0.0 else nucleus = cumsum(sorted_probs) < p_lb delta_probs[nucleus] = 0.0 end if cut_idx = argmin(delta_probs) sorted_probs[cut_idx+1:] = 0.0 probs = sort_by_idx(sorted_probs, indices) logits = log(probs/sum(probs))/tau probs = softmax(logits) Output: probs.
Open Source Code Yes 1The code and results are available at: https://github.com/giorgiofranceschelli/Diff Sampling-tmlr
Open Datasets Yes For the math problem-solving tasks, we use the Llama2-based Meta Math-7B-V1.0 model trained with selfsupervised learning on Meta Math QA (Yu et al., 2024). For extreme text summarization and story generation, we utilize the Llama-3.2-3B model (Grattafiori et al., 2024), with both original and -Instruct versions. Finally, for the divergent association task, we consider Meta-Llama-3-8B (Grattafiori et al., 2024), using both pre-trained and DPO-tuned -Instruct versions. We study the performances of our three methods: Diff Sampling-cut; Diff Sampling-lb with plb = 0.9; and Diff Sampling-minp with pmin = 0.1.
Dataset Splits Yes In particular, we consider the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) test sets; the relative prompts are reported in Appendix D. To avoid resource wasting, we focus on entries with a problem and a solution of no more than 512 tokens. ... we consider the eXtreme Summarization (XSum) dataset (Narayan et al., 2018), which contains pairs of documents and one-sentence summaries. In particular, we use the test partition (11334 entries) and exclude all entries with a tokenized document longer than 768, obtaining 9815 entries; then, we limit our experiment to 1000 random samples... generating stories of up to 1024 tokens using inputs from the Writing Prompts dataset (Fan et al., 2018)... In particular, we sample 500 test prompts among those labeled as standard prompts (i.e., that start with [WP]), and we generate 5 outputs for each sampling scheme.
Hardware Specification Yes All experiments were carried out on a Linux-based local server equipped with 2 80GB NVIDIA H100 GPUs running Python 3.11.9.
Software Dependencies Yes All experiments were carried out on a Linux-based local server equipped with 2 80GB NVIDIA H100 GPUs running Python 3.11.9. All the trainings were repeated, varying the random seed among 1, 42, and 121 (set through the set_seed method from the Hugging Face transformers library).
Experiment Setup Yes The hyperparameters governing the sampling strategies adopted as baselines were selected according to the best results reported by their original paper for similar tasks and model sizes. We study the performances of our three methods: Diff Sampling-cut; Diff Sampling-lb with plb = 0.9; and Diff Sampling-minp with pmin = 0.1. ... We repeat the experiment 100 times for non-greedy strategies to mitigate the sampling stochasticity.