Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?
Authors: Yifan Feng, Chengwu Yang, Xingliang Hou, Shaoyi Du, Shihui Ying, Zongze Wu, Yue Gao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark s effectiveness in identifying model strengths and weaknesses. Our specialized prompting framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. |
| Researcher Affiliation | Academia | Yifan Feng1, Chengwu Yang2, Xingliang Hou3, Shaoyi Du2, Shihui Ying4, Zongze Wu5*, Yue Gao1 1School of Software, BNRist, THUIBCS, BLBCI, Tsinghua University 2Institute of Artificial Intelligence and Robotics, College of Artificial Intelligence, Xi an Jiaotong University 3School of Software, Xi an Jiaotong University 4Department of Mathematics, School of Science, Shanghai University 5College of Mechatronics and Control Engineering, Shenzhen University EMAIL, EMAIL Hou EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and a prompt framework visually and textually (e.g., Figure 3, Figure 4, Section 3.1, 3.2, 3.3) but does not contain any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The source codes are at https://github.com/iMoonLab/LLM4Hypergraph. |
| Open Datasets | No | The paper mentions using "synthetic and real-world hypergraphs from citation networks and protein structures" and discusses "Coauthorship dataset and the Protein dataset". While these are types of commonly available data, the paper does not provide specific named datasets, direct URLs, DOIs, or bibliographic citations for *accessing* these specific datasets, as required by the criteria for 'Yes'. |
| Dataset Splits | No | The paper describes the composition of its benchmark with "21,500 problems" and "1,500 samples" per task type, and categorizes hypergraphs by scale (small, medium, large). However, it does not provide explicit training/validation/test splits in percentages or absolute counts for a model that *they* trained, as their work focuses on evaluating existing LLMs on a benchmark rather than training a new model from scratch with specific data splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the evaluations, such as GPU models, CPU types, or memory specifications. It only lists the LLMs evaluated (e.g., ERNIE-Lite-8K, GPT-4o). |
| Software Dependencies | No | The paper mentions evaluating specific LLMs (e.g., GPT-4o, LLaMA3-8B) and uses the 'DHG toolkit' for generating synthetic hypergraphs, but it does not provide specific version numbers for any software dependencies. The criteria for 'Yes' require specific version numbers for key software components. |
| Experiment Setup | No | The paper describes the prompting framework (e.g., ZERO-SHOT, FEW-SHOT, COT, Hyper-BAG, Hyper-COT) and how examples are provided (e.g., 'two examples by default' for Few-Shot/CoT), and mentions balancing positive-to-negative ratios for Decision Problems. However, it does not explicitly provide concrete hyperparameter values or system-level training settings like learning rates, batch sizes, or optimizer configurations, which are typically found in experimental setup details for training models. |