reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learn to Think: Bootstrapping LLM Logic Through Graph Representation Learning

Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design.
Researcher Affiliation	Academia	Hang Gao1,2 , Chenhao Zhang1,2,3 , Tie Wang4 , Junsuo Zhao1,2,3 , Fengge Wu1,2,3 , Changwen Zheng1,2,3 and Huaping Liu5 1 Institute of Software, Chinese Academy of Sciences. 2 National Key Laboratory of Space Integrated Information System. 3 University of Chinese Academy of Sciences. 4 Peking University. 5 Tsinghua University.
Pseudocode	No	The paper describes the method in prose and flow diagrams (Figure 3) but does not include a dedicated pseudocode block or algorithm listing.
Open Source Code	Yes	Code can be found in https://github.com/zch65458525/L2T.
Open Datasets	Yes	Tasks We evaluated our method on four distinct tasks: Sudoku, the Game of 24, Truth Quest [Mondorf and Plank, 2024], and Creative Writing.
Dataset Splits	No	Min and Max represent the best and worst performances achieved by a method, respectively, in terms of the number of correct solutions out of 13 total puzzle sets. The paper describes problem sets for evaluation but does not specify train/validation/test splits for model training or evaluation.
Hardware Specification	No	We utilized the GPT-4o API to conduct all the experiments, including those for the baselines. The paper does not specify any particular hardware used for running experiments or the GNN module.
Software Dependencies	No	We utilized the GPT-4o API to conduct all the experiments, including those for the baselines. No specific software versions for frameworks (e.g., PyTorch, TensorFlow) or libraries used for the GNN are mentioned.
Experiment Setup	Yes	For the implementation of g( ), we utilize a one-layer Graph Convolutional Network (GCN) [Kipf and Welling, 2017] followed by a two-layer Multi-Layer Perceptron (MLP). We adopt the widely used PPO framework [Schulman et al., 2017] for LLM training as the specific implementation of the Actor-Critic algorithm, optimizing and updating the Actor and Critic that we have constructed. The reward rk is set to 100 if the generated thought represents the final result. Otherwise, it is an integer between 0 and 10, determined by the LLM based on G(k) and Xeva.