Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

Authors: João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander Lew, Tim Vieira, Timothy O'Donnell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our approach and six alternatives to four challenging problem domains: Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis ( 3.1). We find that, with little overhead, our approach significantly improves performance across domains, allowing small open-source language models to outperform models over 8 larger, as well as closed-source, fine-tuned ones. Table 2: Comparison of method performance across domains with bootstrapped 95% confidence intervals.
Researcher Affiliation Academia João Loula 1 Benjamin Le Brun 5 Li Du 6 Ben Lipkin1 Clemente Pasti2 Gabriel Grand1 Tianyu Liu2 Yahya Emara2 Marjorie Freedman8 Jason Eisner6 Ryan Cotterell2 Vikash Mansinghka 1 Alexander K. Lew 1,7 Tim Vieira 2 Timothy J. O Donnell 3,4,5 1MIT 2ETH Zürich 3Mc Gill 4Canada CIFAR AI Chair 5Mila 6Johns Hopkins 7Yale 8ISI EMAIL
Pseudocode Yes Algorithm 1 Character proposal: This procedure implements a properly weighted proposal distribution for the unnormalized version of the locally constrained distribution ℓ{ϕG}( | x).
Open Source Code Yes https://github.com/probcomp/genlm-control
Open Datasets Yes Goal inference (Planetarium). ... Data: Blocksworld tasks with up to 10 objects from the Planetarium benchmark (Zuo et al., 2024). ... Python for data science (DS-1000). ... Data: DS-1000 benchmark (Lai et al., 2023). ... Text-to-SQL (Spider). ... Data: Spider development split (Yu et al., 2018). ... Molecular synthesis (GDB-17). ... Data: Few-shot prompts constructed by repeatedly choosing 20 random examples from the GDB-17 dataset (Ruddigkeit et al., 2012).
Dataset Splits Yes Data: Blocksworld tasks with up to 10 objects from the Planetarium benchmark (Zuo et al., 2024). ... Data: Few-shot prompts constructed by repeatedly choosing 20 random examples from the GDB-17 dataset (Ruddigkeit et al., 2012). ... Data: Spider development split (Yu et al., 2018).
Hardware Specification Yes We ran experiments on GCP instances with 1 A100 GPU and 12 v CPUs (our CFG parser is implemented for CPU and is parallelized across particles), with the exception of the Data Science domain, for which we used 4 H100 GPUs and 64 v CPUs.
Software Dependencies No The paper mentions software like "Python RDKit library (Landrum, 2024)", "Python partialsmiles library (O Boyle, 2024)", and "VAL plan validator (Howey et al., 2004)". While these are specific tools, they do not include explicit version numbers for the libraries themselves, which is required for a 'Yes' classification. It also references specific Llama model versions but these are pre-trained models rather than ancillary software dependencies in the strict sense for replication of the code environment.
Experiment Setup Yes We report results using N = 10 particles; see Appendix A.2 and Fig. 2 for downstream accuracy results for a varying number of particles. ... We ran the without replacement baseline (SMC Steering) with N = 5 particles and a beam size of 3, alongside our approach using multinomial resampling with N = 10 particles (and an ESS threshold of 0.9).