On the Query Complexity of Verifier-Assisted Language Generation
Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to backtrack (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling) both in terms of computational efficiency, accuracy and diversity. We perform a thorough empirical investigation into the applicability of Tokenwise rejection sampling with backtracking in constrained language generation, and benchmark it against common baselines, including rejection sampling (Definition 4), nucleus sampling (Holtzman et al., 2020), temperature scaling, and block best-of-N (Appendix B.2.3) sampling, on both synthetic data (Section 5.1) and more realistic data (Section 5.2). |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Microsoft Research NYC. Correspondence to: Yuchen Li <EMAIL>, Andrej Risteski <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Tokenwise rejection sampling with backtracking |
| Open Source Code | Yes | 1Our codes are released at https://github.com/Yuchen Li01/LM_Query_Complexity |
| Open Datasets | No | The paper uses synthetic data generated based on the Dyck grammar (Section 5.1.1) and generates task prompts for testing Code Llama (Section 5.2.1). There is no explicit statement about making these generated datasets publicly available through a link, DOI, or repository. The paper describes the *process* of data generation, but not *access* to the generated data itself for reproducibility outside of re-running the generation process. |
| Dataset Splits | Yes | Among the 882 strings in Xerror Xcorrect, we use 792 samples for training, and 90 samples for validation. We independently sampled 10000 such out-of-distribution prompts Dyckunseen OOD. The test prompts include 10 different target function names that are unseen during training. Each target function name is independently tested 10 times. |
| Hardware Specification | No | The paper mentions general terms like "LLM", "GPUs", and "TPUs" and states that computational resources were sponsored by the Simons Foundation and Google. However, it does not provide specific models or specifications for the hardware used in the experiments (e.g., "NVIDIA A100", "Intel Xeon"). |
| Software Dependencies | No | The paper mentions using a "pretrained Code Llama (Roziere et al., 2023)" as the generator and "autoregressive Transformer (Vaswani et al., 2017)" models. It also mentions nucleus sampling. However, it does not specify software dependencies with version numbers, such as Python versions, specific deep learning frameworks (e.g., PyTorch, TensorFlow) and their versions, or other libraries. |
| Experiment Setup | Yes | In our experiments, we pretrain autoregressive Transformer (Vaswani et al., 2017) Language models (6 layers, 8 heads per layer, hidden dimension 512) from scratch on data sampled from DDyck with D = 32, p = 0.2, q = 0.5. We use batch size 32, weight decay 0.1, learning rate 3e-4 with 100 warmup steps, and follow Block et al. (2024) to use exponential moving average to stabilize training. We extensively tuned the hyperparameters in common baseline decoding algorithms, including nucleus sampling (Holtzman et al., 2020): we grid-searched top p [0.0, 0.7, 0.8, 0.9, 0.95, 1.0]. temperature scaling (Ackley et al., 1985): we grid-searched temperature [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2] (for each top p). |