SynCode: LLM Generation with Grammar Augmentation

Authors: Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, Gagandeep Singh

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments evaluating the effectiveness of Syn Code for JSON generation demonstrate that Syn Code eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how Syn Code significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation.
Researcher Affiliation Collaboration Shubham Ugare University of Illinois Urbana-Champaign, USA Tarun Suresh University of Illinois Urbana-Champaign, USA Hangoo Kang University of Illinois Urbana-Champaign, USA Sasa Misailovic University of Illinois Urbana-Champaign, USA Gagandeep Singh University of Illinois Urbana-Champaign and VMware Research, USA
Pseudocode Yes Algorithm 1 Masked LLM Generation... Algorithm 2 Computing Grammar Mask... Algorithm 3 Syn Code Generation... Algorithm 4 Incremental Parsing Algorithm
Open Source Code Yes Our code is available at https://github.com/uiuc-focal-lab/syncode
Open Datasets Yes We consider JSON-Mode-Eval (Nous Research, 2024) dataset for text to JSON generation and Human Eval and MBXP (Athiwaratkun et al., 2023) dataset for evaluating Python and Go code generation. We display examples of prompts from these datasets in Appendix A.7. JSON-Mode-Eval (Nous Research, 2024). It consists of 100 zero-shot problems. Spider text-2-SQL. Spider (Yu et al., 2018) text-to-SQL dataset consists of 1,034 problems of varying difficulty levels: easy (250), medium (440), hard (174), and extra hard (170). Multilingual Human Eval (Athiwaratkun et al., 2023). It is an extension of the original Human Eval collection (Chen et al., 2021)... MBXP (Athiwaratkun et al., 2023). It is extended from the MBPP (Austin et al., 2021) dataset for Python to support other languages such as Go.
Dataset Splits No The paper lists dataset characteristics and sizes (e.g., Spider dataset difficulty levels with counts: easy (250), medium (440), hard (174), and extra hard (170)), and mentions generating 'n = 20 and n = 1 samples per problem' for code completion tasks. However, it does not explicitly provide information on how these datasets are split into training, validation, or test sets for the models evaluated, or specific percentages/counts for such splits.
Hardware Specification Yes We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs.
Software Dependencies No Syn Code is implemented using Py Torch (Paszke et al., 2019a), Hugging Face transformers library (Wolf et al., 2020) and Lark library (Lark, ). While these libraries are mentioned, specific version numbers for PyTorch, Hugging Face transformers, and Lark are not provided in the text.
Experiment Setup Yes We set max new tokens nmax = 400. Using greedy decoding and \n\n is used as an additional stopping condition for all experiments. We use the hyperparameters temperature = 0.2 and top p = 0.95.