reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Breaking the Self-Evaluation Barrier: Reinforced Neuro-Symbolic Planning with Large Language Models

Authors: Jie-Jing Shao, Hong-Jie You, Guohao Cai, Quanyu Dai, Zhenhua Dong, Lan-Zhe Guo

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our approach significantly improves planning accuracy and constraint satisfaction across various domains, outperforming traditional self-evaluation methods. It highlights the potential of hybrid neuro-symbolic systems to address complex constrained planning tasks. 4 Empirical Study 4.1 Experimental Setup We evaluate our proposal on diverse tasks, including Game of 24, Game of 28, Game of 30, Constrained Knapsack, and Travel Planning.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2School of Artificial Intelligence, Nanjing University, Nanjing, China 3Huawei Noah s Ark Lab, Shenzhen, China 4School of Intelligence Science and Technology, Nanjing University, Nanjing, China EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 The proposed RNSP
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing code, nor does it include a link to a code repository. The 'Conclusion and Discussion' section mentions future work related to a reward model but does not refer to the current work's code release.
Open Datasets	Yes	For the game of 24, we follow the [Yao et al., 2023a] and collect the data from 4nums.com, a website hosting mathematical games, specifically selecting 1,362 games sorted by human solving time from easy to hard. We further conduct the experiments on a real-world planning benchmark Travel Planner [Xie et al., 2024].
Dataset Splits	Yes	The samples indexed 800-900 are utilized to train, and the samples indexed 901-1000 are utilized to test. For the Game of 28 and Game of 30... The scale of the training and testing data is the same as that for the Game of 24, with each consisting of 100 problems.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions using specific Large Language Models (LLMs) like Deep Seek-V3, GPT-4o, GPT-4o-mini, and GPT-3.5-turbo, but it does not specify version numbers for any ancillary software, libraries, or programming languages used in their implementation.
Experiment Setup	Yes	We set a beam width B to control the complexity of the search. This structured search ensures that computational resources are efficiently used to explore feasible solutions. Given a state st in the current candidates, the LLMs ϕ are employed to generate K action candidates [a1 t, a2 t, , a K t ] g(at\|st, ϕ).