reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation

Authors: Tao Feng, Yihang Sun, Jiaxuan You

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on two datasets show Graph Eval improves F1 scores by at least 14% with low computation and API costs. Additionally, Graph Eval can effectively detect plagiarized ideas.
Researcher Affiliation	Academia	Tao Feng1, Yihang Sun2, Jiaxuan You1 1University of Illinois at Urbana-Champaign 2Peking University
Pseudocode	Yes	Algorithm 1 Training of Graph Eval Require: Dataset Dtrain = {(x, y)}. A weighted GNN fϕ. Edge weights wv. Number of GNN layers L.
Open Source Code	Yes	Our codes for Graph Eval is released at https://github.com/ulab-uiuc/Graph Eval.
Open Datasets	Yes	ICLR Papers: We collect abstracts and review decisions from paper submissions to the ICLR conferences between 2021 and 2023. From this, we randomly select 300 papers as the training set for learning-based methods and 50 papers as the test set. AI Researcher Dataset: We use the dataset collected by Si et al. (2024) in AI Researcher as an additional test set, which contains academic papers focusing on the domain of novel prompting methods.
Dataset Splits	Yes	ICLR Papers: From this, we randomly select 300 papers as the training set for learning-based methods and 50 papers as the test set. AI Researcher Dataset: For testing other methods, we split the dataset into training and testing sets in an 85%:15% ratio and conduct multiple experiments to average the results, thereby reducing bias. ASAP-Review dataset: We divided the dataset into training, validation, and test sets in the proportions of 70%, 10%, and 20%, respectively.
Hardware Specification	Yes	Our proposed method is implemented using Py Torch2 and Py Torch Geometric (Py G)3, with all experiments conducted on a single NVIDIA A100 Tensor Core GPU.
Software Dependencies	No	The paper mentions using "Py Torch" and "Py Torch Geometric (Py G)" but does not provide specific version numbers for these software components, nor for the Adam optimizer.
Experiment Setup	Yes	During the training phase, we configured the graph neural network as a two-layer weighted GNN with a hidden dimension of 64. The batch size is set to 64, and the maximum number of training epochs is limited to 1000. We employ the Adam optimizer (Diederik, 2014) for training and gradually reduce the learning rate from 1e-3 to 0 using a Lambda LR scheduler.