reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

Authors: William H English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of Gra FT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that Gra FT improves the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, University of Florida, Gainesville, Florida 2Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, Florida.
Pseudocode	Yes	Algorithm 1 Temporal Logic Logits Processor
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It does not contain a specific repository link, an explicit code release statement, or indicate code in supplementary materials.
Open Datasets	Yes	Our evaluation datasets include Navigation (Wang et al., 2021), GLTL (Gopalan et al., 2018), and CW (Mac Glashan et al., 2015). Some statistics on these datasets are given in the appendix A.1.
Dataset Splits	No	We perform our evaluation of the translation models and end-to-end approaches using 1000 examples from each dataset.
Hardware Specification	Yes	We conducted our evaluation on a machine with one NVIDIA RTX 4070 Ti Super GPU, one Intel i9-14900KF 32 Core CPU, and 64GB of RAM.
Software Dependencies	No	The paper mentions models like BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020), and states 'The T5 checkpoint provided at the Hugging Face-hosted repository (Raffel et al., 2020)', but it does not specify concrete version numbers for any software libraries, frameworks, or environments used to implement or run the experiments.
Experiment Setup	Yes	Each AP grounding model was trained for 3 epochs at a learning rate of 1e-5. Each translation model was trained for 3 epochs at a learning rate of 2e-5.