reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Query Complexity of Verifier-Assisted Language Generation

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to backtrack (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling) both in terms of computational efficiency, accuracy and diversity. We perform a thorough empirical investigation into the applicability of Tokenwise rejection sampling with backtracking in constrained language generation, and benchmark it against common baselines, including rejection sampling (Definition 4), nucleus sampling (Holtzman et al., 2020), temperature scaling, and block best-of-N (Appendix B.2.3) sampling, on both synthetic data (Section 5.1) and more realistic data (Section 5.2).
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Microsoft Research NYC. Correspondence to: Yuchen Li <EMAIL>, Andrej Risteski <EMAIL>.
Pseudocode	Yes	Algorithm 1 Tokenwise rejection sampling with backtracking
Open Source Code	Yes	1Our codes are released at https://github.com/Yuchen Li01/LM_Query_Complexity
Open Datasets	No	The paper uses synthetic data generated based on the Dyck grammar (Section 5.1.1) and generates task prompts for testing Code Llama (Section 5.2.1). There is no explicit statement about making these generated datasets publicly available through a link, DOI, or repository. The paper describes the process of data generation, but not access to the generated data itself for reproducibility outside of re-running the generation process.
Dataset Splits	Yes	Among the 882 strings in Xerror Xcorrect, we use 792 samples for training, and 90 samples for validation. We independently sampled 10000 such out-of-distribution prompts Dyckunseen OOD. The test prompts include 10 different target function names that are unseen during training. Each target function name is independently tested 10 times.
Hardware Specification	No	The paper mentions general terms like "LLM", "GPUs", and "TPUs" and states that computational resources were sponsored by the Simons Foundation and Google. However, it does not provide specific models or specifications for the hardware used in the experiments (e.g., "NVIDIA A100", "Intel Xeon").
Software Dependencies	No	The paper mentions using a "pretrained Code Llama (Roziere et al., 2023)" as the generator and "autoregressive Transformer (Vaswani et al., 2017)" models. It also mentions nucleus sampling. However, it does not specify software dependencies with version numbers, such as Python versions, specific deep learning frameworks (e.g., PyTorch, TensorFlow) and their versions, or other libraries.
Experiment Setup	Yes	In our experiments, we pretrain autoregressive Transformer (Vaswani et al., 2017) Language models (6 layers, 8 heads per layer, hidden dimension 512) from scratch on data sampled from DDyck with D = 32, p = 0.2, q = 0.5. We use batch size 32, weight decay 0.1, learning rate 3e-4 with 100 warmup steps, and follow Block et al. (2024) to use exponential moving average to stabilize training. We extensively tuned the hyperparameters in common baseline decoding algorithms, including nucleus sampling (Holtzman et al., 2020): we grid-searched top p [0.0, 0.7, 0.8, 0.9, 0.95, 1.0]. temperature scaling (Ackley et al., 1985): we grid-searched temperature [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2] (for each top p).