reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens. In this paper, we propose Diff Sampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.
Researcher Affiliation	Academia	Giorgio Franceschelli EMAIL Alma Mater Studiorum Università di Bologna, Bologna, Italy Mirco Musolesi EMAIL University College London, London, United Kingdom Alma Mater Studiorum Università di Bologna, Bologna, Italy
Pseudocode	Yes	Algorithm 1 Diff Sampling Input: probabilities probs = [p[1] t . . . p[N] t ], lower bound p_lb = plb, upper bound p_min = pmin, temperature tau = τ. sorted_probs, indices = sort(probs) fwd_probs = sorted_probs[1:] + [0.0] delta_probs = fwd_probs sorted_probs if p_min > 0.0 then lim = argmin(sorted_probs > p_min sorted_probs[0]) 1 delta_probs[:lim] = 0.0 else nucleus = cumsum(sorted_probs) < p_lb delta_probs[nucleus] = 0.0 end if cut_idx = argmin(delta_probs) sorted_probs[cut_idx+1:] = 0.0 probs = sort_by_idx(sorted_probs, indices) logits = log(probs/sum(probs))/tau probs = softmax(logits) Output: probs.
Open Source Code	Yes	1The code and results are available at: https://github.com/giorgiofranceschelli/Diff Sampling-tmlr
Open Datasets	Yes	For the math problem-solving tasks, we use the Llama2-based Meta Math-7B-V1.0 model trained with selfsupervised learning on Meta Math QA (Yu et al., 2024). For extreme text summarization and story generation, we utilize the Llama-3.2-3B model (Grattafiori et al., 2024), with both original and -Instruct versions. Finally, for the divergent association task, we consider Meta-Llama-3-8B (Grattafiori et al., 2024), using both pre-trained and DPO-tuned -Instruct versions. We study the performances of our three methods: Diff Sampling-cut; Diff Sampling-lb with plb = 0.9; and Diff Sampling-minp with pmin = 0.1.
Dataset Splits	Yes	In particular, we consider the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) test sets; the relative prompts are reported in Appendix D. To avoid resource wasting, we focus on entries with a problem and a solution of no more than 512 tokens. ... we consider the eXtreme Summarization (XSum) dataset (Narayan et al., 2018), which contains pairs of documents and one-sentence summaries. In particular, we use the test partition (11334 entries) and exclude all entries with a tokenized document longer than 768, obtaining 9815 entries; then, we limit our experiment to 1000 random samples... generating stories of up to 1024 tokens using inputs from the Writing Prompts dataset (Fan et al., 2018)... In particular, we sample 500 test prompts among those labeled as standard prompts (i.e., that start with [WP]), and we generate 5 outputs for each sampling scheme.
Hardware Specification	Yes	All experiments were carried out on a Linux-based local server equipped with 2 80GB NVIDIA H100 GPUs running Python 3.11.9.
Software Dependencies	Yes	All experiments were carried out on a Linux-based local server equipped with 2 80GB NVIDIA H100 GPUs running Python 3.11.9. All the trainings were repeated, varying the random seed among 1, 42, and 121 (set through the set_seed method from the Hugging Face transformers library).
Experiment Setup	Yes	The hyperparameters governing the sampling strategies adopted as baselines were selected according to the best results reported by their original paper for similar tasks and model sizes. We study the performances of our three methods: Diff Sampling-cut; Diff Sampling-lb with plb = 0.9; and Diff Sampling-minp with pmin = 0.1. ... We repeat the experiment 100 times for non-greedy strategies to mitigate the sampling stochasticity.