reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Authors: Yifan Feng, Chengwu Yang, Xingliang Hou, Shaoyi Du, Shihui Ying, Zongze Wu, Yue Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark s effectiveness in identifying model strengths and weaknesses. Our specialized prompting framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks.
Researcher Affiliation	Academia	Yifan Feng1, Chengwu Yang2, Xingliang Hou3, Shaoyi Du2, Shihui Ying4, Zongze Wu5*, Yue Gao1 1School of Software, BNRist, THUIBCS, BLBCI, Tsinghua University 2Institute of Artificial Intelligence and Robotics, College of Artificial Intelligence, Xi an Jiaotong University 3School of Software, Xi an Jiaotong University 4Department of Mathematics, School of Science, Shanghai University 5College of Mechatronics and Control Engineering, Shenzhen University EMAIL, EMAIL Hou EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and a prompt framework visually and textually (e.g., Figure 3, Figure 4, Section 3.1, 3.2, 3.3) but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	The source codes are at https://github.com/iMoonLab/LLM4Hypergraph.
Open Datasets	No	The paper mentions using "synthetic and real-world hypergraphs from citation networks and protein structures" and discusses "Coauthorship dataset and the Protein dataset". While these are types of commonly available data, the paper does not provide specific named datasets, direct URLs, DOIs, or bibliographic citations for accessing these specific datasets, as required by the criteria for 'Yes'.
Dataset Splits	No	The paper describes the composition of its benchmark with "21,500 problems" and "1,500 samples" per task type, and categorizes hypergraphs by scale (small, medium, large). However, it does not provide explicit training/validation/test splits in percentages or absolute counts for a model that they trained, as their work focuses on evaluating existing LLMs on a benchmark rather than training a new model from scratch with specific data splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the evaluations, such as GPU models, CPU types, or memory specifications. It only lists the LLMs evaluated (e.g., ERNIE-Lite-8K, GPT-4o).
Software Dependencies	No	The paper mentions evaluating specific LLMs (e.g., GPT-4o, LLaMA3-8B) and uses the 'DHG toolkit' for generating synthetic hypergraphs, but it does not provide specific version numbers for any software dependencies. The criteria for 'Yes' require specific version numbers for key software components.
Experiment Setup	No	The paper describes the prompting framework (e.g., ZERO-SHOT, FEW-SHOT, COT, Hyper-BAG, Hyper-COT) and how examples are provided (e.g., 'two examples by default' for Few-Shot/CoT), and mentions balancing positive-to-negative ratios for Decision Problems. However, it does not explicitly provide concrete hyperparameter values or system-level training settings like learning rates, batch sizes, or optimizer configurations, which are typically found in experimental setup details for training models.