reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RuAG: Learned-rule-augmented Generation for Large Language Models

Authors: Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yevgeniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravanakumar Rajmohan, Qi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS
Researcher Affiliation	Collaboration	1Eindhoven University of Technology 2Peking University 3Microsoft 4University of Liverpool 5King s College London
Pseudocode	No	The paper describes the MCTS process in Section 3.2 by outlining its phases (selection, expansion, simulation, backpropagation) and providing the UCT formula, but it does not present this information in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Project link: https://github.com/microsoft/Ru AG.
Open Datasets	Yes	We evaluate our framework across diverse scenarios, including public tasks in NLP (relation extraction on DWIE), time-series (log anomaly detection on HDFS), decision-making (the cooperative game Alice and Bob), and an industrial task in abuse detection, demonstrating its effectiveness in enhancing LLM s capability over diverse tasks. Project link: https://github.com/microsoft/Ru AG.
Dataset Splits	Yes	We conduct experiments on the DWIE dataset (Zaporojets et al., 2021), which contains 802 documents and 23,130 entities. After excluding irrelevant articles, 700 documents are used for training and 97 for testing. and The dataset is split chronologically into training, validation, and test sets with a ratio of 8:1:1.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used for running the experiments, such as exact GPU or CPU models, memory, or detailed computer specifications.
Software Dependencies	No	The paper mentions using GPT-3.5 (gpt-35-turbo-16k-20230613) and GPT-4 (gpt-4-20230613) as LLM backbones, but does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch), or other solvers.
Experiment Setup	Yes	We provide detailed implementation for the three public tasks and the hyperparamter in Table A5.