On the Query Complexity of Verifier-Assisted Language Generation

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to backtrack (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling) both in terms of computational efficiency, accuracy and diversity. We perform a thorough empirical investigation into the applicability of Tokenwise rejection sampling with backtracking in constrained language generation, and benchmark it against common baselines, including rejection sampling (Definition 4), nucleus sampling (Holtzman et al., 2020), temperature scaling, and block best-of-N (Appendix B.2.3) sampling, on both synthetic data (Section 5.1) and more realistic data (Section 5.2).
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Microsoft Research NYC. Correspondence to: Yuchen Li <EMAIL>, Andrej Risteski <EMAIL>.
Pseudocode Yes Algorithm 1 Tokenwise rejection sampling with backtracking
Open Source Code Yes 1Our codes are released at https://github.com/Yuchen Li01/LM_Query_Complexity
Open Datasets No The paper uses synthetic data generated based on the Dyck grammar (Section 5.1.1) and generates task prompts for testing Code Llama (Section 5.2.1). There is no explicit statement about making these generated datasets publicly available through a link, DOI, or repository. The paper describes the *process* of data generation, but not *access* to the generated data itself for reproducibility outside of re-running the generation process.
Dataset Splits Yes Among the 882 strings in Xerror Xcorrect, we use 792 samples for training, and 90 samples for validation. We independently sampled 10000 such out-of-distribution prompts Dyckunseen OOD. The test prompts include 10 different target function names that are unseen during training. Each target function name is independently tested 10 times.
Hardware Specification No The paper mentions general terms like "LLM", "GPUs", and "TPUs" and states that computational resources were sponsored by the Simons Foundation and Google. However, it does not provide specific models or specifications for the hardware used in the experiments (e.g., "NVIDIA A100", "Intel Xeon").
Software Dependencies No The paper mentions using a "pretrained Code Llama (Roziere et al., 2023)" as the generator and "autoregressive Transformer (Vaswani et al., 2017)" models. It also mentions nucleus sampling. However, it does not specify software dependencies with version numbers, such as Python versions, specific deep learning frameworks (e.g., PyTorch, TensorFlow) and their versions, or other libraries.
Experiment Setup Yes In our experiments, we pretrain autoregressive Transformer (Vaswani et al., 2017) Language models (6 layers, 8 heads per layer, hidden dimension 512) from scratch on data sampled from DDyck with D = 32, p = 0.2, q = 0.5. We use batch size 32, weight decay 0.1, learning rate 3e-4 with 100 warmup steps, and follow Block et al. (2024) to use exponential moving average to stabilize training. We extensively tuned the hyperparameters in common baseline decoding algorithms, including nucleus sampling (Holtzman et al., 2020): we grid-searched top p [0.0, 0.7, 0.8, 0.9, 0.95, 1.0]. temperature scaling (Ackley et al., 1985): we grid-searched temperature [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2] (for each top p).