reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SynCode: LLM Generation with Grammar Augmentation

Authors: Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, Gagandeep Singh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments evaluating the effectiveness of Syn Code for JSON generation demonstrate that Syn Code eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how Syn Code significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation.
Researcher Affiliation	Collaboration	Shubham Ugare University of Illinois Urbana-Champaign, USA Tarun Suresh University of Illinois Urbana-Champaign, USA Hangoo Kang University of Illinois Urbana-Champaign, USA Sasa Misailovic University of Illinois Urbana-Champaign, USA Gagandeep Singh University of Illinois Urbana-Champaign and VMware Research, USA
Pseudocode	Yes	Algorithm 1 Masked LLM Generation... Algorithm 2 Computing Grammar Mask... Algorithm 3 Syn Code Generation... Algorithm 4 Incremental Parsing Algorithm
Open Source Code	Yes	Our code is available at https://github.com/uiuc-focal-lab/syncode
Open Datasets	Yes	We consider JSON-Mode-Eval (Nous Research, 2024) dataset for text to JSON generation and Human Eval and MBXP (Athiwaratkun et al., 2023) dataset for evaluating Python and Go code generation. We display examples of prompts from these datasets in Appendix A.7. JSON-Mode-Eval (Nous Research, 2024). It consists of 100 zero-shot problems. Spider text-2-SQL. Spider (Yu et al., 2018) text-to-SQL dataset consists of 1,034 problems of varying difficulty levels: easy (250), medium (440), hard (174), and extra hard (170). Multilingual Human Eval (Athiwaratkun et al., 2023). It is an extension of the original Human Eval collection (Chen et al., 2021)... MBXP (Athiwaratkun et al., 2023). It is extended from the MBPP (Austin et al., 2021) dataset for Python to support other languages such as Go.
Dataset Splits	No	The paper lists dataset characteristics and sizes (e.g., Spider dataset difficulty levels with counts: easy (250), medium (440), hard (174), and extra hard (170)), and mentions generating 'n = 20 and n = 1 samples per problem' for code completion tasks. However, it does not explicitly provide information on how these datasets are split into training, validation, or test sets for the models evaluated, or specific percentages/counts for such splits.
Hardware Specification	Yes	We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs.
Software Dependencies	No	Syn Code is implemented using Py Torch (Paszke et al., 2019a), Hugging Face transformers library (Wolf et al., 2020) and Lark library (Lark, ). While these libraries are mentioned, specific version numbers for PyTorch, Hugging Face transformers, and Lark are not provided in the text.
Experiment Setup	Yes	We set max new tokens nmax = 400. Using greedy decoding and \n\n is used as an additional stopping condition for all experiments. We use the hyperparameters temperature = 0.2 and top p = 0.95.