reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

Authors: João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander Lew, Tim Vieira, Timothy O'Donnell

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our approach and six alternatives to four challenging problem domains: Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis ( 3.1). We find that, with little overhead, our approach significantly improves performance across domains, allowing small open-source language models to outperform models over 8 larger, as well as closed-source, fine-tuned ones. Table 2: Comparison of method performance across domains with bootstrapped 95% confidence intervals.
Researcher Affiliation	Academia	João Loula 1 Benjamin Le Brun 5 Li Du 6 Ben Lipkin1 Clemente Pasti2 Gabriel Grand1 Tianyu Liu2 Yahya Emara2 Marjorie Freedman8 Jason Eisner6 Ryan Cotterell2 Vikash Mansinghka 1 Alexander K. Lew 1,7 Tim Vieira 2 Timothy J. O Donnell 3,4,5 1MIT 2ETH Zürich 3Mc Gill 4Canada CIFAR AI Chair 5Mila 6Johns Hopkins 7Yale 8ISI EMAIL
Pseudocode	Yes	Algorithm 1 Character proposal: This procedure implements a properly weighted proposal distribution for the unnormalized version of the locally constrained distribution ℓ{ϕG}( \| x).
Open Source Code	Yes	https://github.com/probcomp/genlm-control
Open Datasets	Yes	Goal inference (Planetarium). ... Data: Blocksworld tasks with up to 10 objects from the Planetarium benchmark (Zuo et al., 2024). ... Python for data science (DS-1000). ... Data: DS-1000 benchmark (Lai et al., 2023). ... Text-to-SQL (Spider). ... Data: Spider development split (Yu et al., 2018). ... Molecular synthesis (GDB-17). ... Data: Few-shot prompts constructed by repeatedly choosing 20 random examples from the GDB-17 dataset (Ruddigkeit et al., 2012).
Dataset Splits	Yes	Data: Blocksworld tasks with up to 10 objects from the Planetarium benchmark (Zuo et al., 2024). ... Data: Few-shot prompts constructed by repeatedly choosing 20 random examples from the GDB-17 dataset (Ruddigkeit et al., 2012). ... Data: Spider development split (Yu et al., 2018).
Hardware Specification	Yes	We ran experiments on GCP instances with 1 A100 GPU and 12 v CPUs (our CFG parser is implemented for CPU and is parallelized across particles), with the exception of the Data Science domain, for which we used 4 H100 GPUs and 64 v CPUs.
Software Dependencies	No	The paper mentions software like "Python RDKit library (Landrum, 2024)", "Python partialsmiles library (O Boyle, 2024)", and "VAL plan validator (Howey et al., 2004)". While these are specific tools, they do not include explicit version numbers for the libraries themselves, which is required for a 'Yes' classification. It also references specific Llama model versions but these are pre-trained models rather than ancillary software dependencies in the strict sense for replication of the code environment.
Experiment Setup	Yes	We report results using N = 10 particles; see Appendix A.2 and Fig. 2 for downstream accuracy results for a varying number of particles. ... We ran the without replacement baseline (SMC Steering) with N = 5 particles and a beam size of 3, alongside our approach using multinomial resampling with N = 10 particles (and an ESS threshold of 0.9).