GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Authors: Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems and exhibit hallucination issues. We further explore four potential solutions to address this issue and improve LLMs on graph computation... In our experiments, we extend the comparative analysis beyond LLMs by incorporating a diverse set of baseline methods...
Researcher Affiliation Academia Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL
Pseudocode No The paper describes various tasks and algorithms (e.g., exact algorithms for ground truth, greedy algorithms, approximation algorithms, GNNs) but does not present any pseudocode or algorithm blocks for its own methodology or proposed solutions.
Open Source Code Yes Graph Arena complements the existing LLM benchmarks and is open-sourced at https://github.com/squareRoot3/GraphArena.
Open Datasets Yes Graph Arena distinguishes itself from previous benchmarks by utilizing real-world graphs... Graphs are collected from five sources: DBLP (Ley, 2002), an academic collaboration network... Social Network (Rossi & Ahmed, 2015)... DBpedia (Bizer et al., 2009)... Open Flights (Open Flights)... and Pub Chem QC (Nakata & Shimazaki, 2017)...
Dataset Splits No For each of the 10 tasks, we randomly sample 500 small and 500 large graphs to create two distinct subsets, yielding a total of 10,000 graphs. For difficulty calibration, we use task-specific graph scales... Additionally, we fine-tuned Llama3-8b and Qwen2-7b on an additional 10,000 Graph Arena problems, using ground-truth solution paths as supervision. The paper mentions using 10,000 problems for fine-tuning and 10,000 for evaluation, but does not specify train/validation splits for the fine-tuning process.
Hardware Specification Yes Smaller open-source models were run on our local infrastructure equipped with four NVIDIA H800 PCIe 80GB GPUs.
Software Dependencies No The paper mentions LLM models like GPT-4o, Claude-3.5-sonnet, Llama3, Qwen2.5, Deepseek-V2, Mixtral-7x8b, and Gemma-7b, along with some training parameters, but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes We utilized a low temperature setting of 0.1, imposed no constraints on output length, and maintained other configurations default for all LLM models. The full-parameter fine-tuning process was conducted with a learning rate of 0.0001, using a cosine learning rate scheduler with a warmup ratio of 0.1, and a batch size of 4.