reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Authors: Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems and exhibit hallucination issues. We further explore four potential solutions to address this issue and improve LLMs on graph computation... In our experiments, we extend the comparative analysis beyond LLMs by incorporating a diverse set of baseline methods...
Researcher Affiliation	Academia	Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode	No	The paper describes various tasks and algorithms (e.g., exact algorithms for ground truth, greedy algorithms, approximation algorithms, GNNs) but does not present any pseudocode or algorithm blocks for its own methodology or proposed solutions.
Open Source Code	Yes	Graph Arena complements the existing LLM benchmarks and is open-sourced at https://github.com/squareRoot3/GraphArena.
Open Datasets	Yes	Graph Arena distinguishes itself from previous benchmarks by utilizing real-world graphs... Graphs are collected from five sources: DBLP (Ley, 2002), an academic collaboration network... Social Network (Rossi & Ahmed, 2015)... DBpedia (Bizer et al., 2009)... Open Flights (Open Flights)... and Pub Chem QC (Nakata & Shimazaki, 2017)...
Dataset Splits	No	For each of the 10 tasks, we randomly sample 500 small and 500 large graphs to create two distinct subsets, yielding a total of 10,000 graphs. For difficulty calibration, we use task-specific graph scales... Additionally, we fine-tuned Llama3-8b and Qwen2-7b on an additional 10,000 Graph Arena problems, using ground-truth solution paths as supervision. The paper mentions using 10,000 problems for fine-tuning and 10,000 for evaluation, but does not specify train/validation splits for the fine-tuning process.
Hardware Specification	Yes	Smaller open-source models were run on our local infrastructure equipped with four NVIDIA H800 PCIe 80GB GPUs.
Software Dependencies	No	The paper mentions LLM models like GPT-4o, Claude-3.5-sonnet, Llama3, Qwen2.5, Deepseek-V2, Mixtral-7x8b, and Gemma-7b, along with some training parameters, but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	We utilized a low temperature setting of 0.1, imposed no constraints on output length, and maintained other configurations default for all LLM models. The full-parameter fine-tuning process was conducted with a learning rate of 0.0001, using a cosine learning rate scheduler with a warmup ratio of 0.1, and a batch size of 4.