A Comprehensive Evaluation on Event Reasoning of Large Language Models

Authors: Zhengwei Tao, Zhi Jin, Yifan Zhang, Xiancai Chen, Haiyan Zhao, Jia Li, Bin Liang, Chongyang Tao, Qun Liu, Kam-Fai Wong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on EV2 to answer these questions. The results provide insights into event reasoning that: 1) LLMs have the abilities of event reasoning, but are far from satisfactory and are imbalanced in different relations and reasoning paradigms. 2) LLMs embed imbalanced abilities in different relations and reasoning paradigms. 3) LLMs have event schema knowledge. However, LLMs are not aligned with humans in the aspect of leveraging event schema knowledge. Based on the findings, we investigate guiding the LLMs to utilize event schema knowledge. With the guidance, LLMs can perform better event reasoning which sheds light on modeling event knowledge as memory of LLMs to enhance event reasoning. We summarize our contributions as follows: We first comprehensively evaluate event reasoning in both abstraction levels of schema and instance, and various relations and paradigms. We construct a benchmark EV2 which features two levels of evaluation and comprehensive in relations and reasoning paradigms. We conduct extensive experiments to probe how LLMs perform event reasoning. We conclude several insights.
Researcher Affiliation Collaboration 1School of Computer Science, Peking University, 2Mo E Key Lab. of High Confidence Software Technologies(PKU), China 3Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong 4Mo E Key Lab. of High Confidence Software Technologies(Hong Kong), China 5 Beihang University, 6 Huawei Noah s Ark Lab EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes This process is executed for all nodes to gather components, as Algorithm 1 in the Appendix.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using existing LLMs (GPT series, Mistral-7B, etc.) and refers to prompt details in the Appendix, but does not offer a link to their own implementation or explicitly state that their code is being released.
Open Datasets No In this paper, we comprehensively evaluate event reasoning in knowledge and abilities. Since there are no existing datasets that are comprehensive in relations and paradigms, and can cover both levels of schema and instance, we introduce a benchmark EV2 for the EValuation of EVent reasoning.
Dataset Splits No The paper provides counts for each task in the EV2 benchmark (e.g., "S-CEC 492", "I-CEC 491"), but it does not specify how these were split into training, validation, and test sets for the LLM evaluation experiments. It describes the total size of the constructed dataset/benchmark but not the partitioning strategy for experimentation.
Hardware Specification No The paper mentions evaluating various LLMs, including GPT-4o, GPT-4, GPT-3.5 (closed-source models via official APIs) and several open-source models (e.g., Mistral-7B, Qwen2-7B). However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used by the authors to run these evaluations or host the open-source models.
Software Dependencies No The paper mentions evaluating specific LLM models (e.g., GPT4o, Mistral-7B) and using "all-mpnet-base-v2 for encoding" and "GPT4 to generate the instance graph Gi." However, it does not list general ancillary software libraries, frameworks, or solvers with specific version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) that would be needed to replicate their experimental setup.
Experiment Setup No The paper mentions "prompt details in the Appendix" but does not provide specific experimental setup details such as hyperparameter values (learning rate, batch size), model initialization, or training schedules in the main text. The experiments involve evaluating pre-trained LLMs, so these details might be less relevant than for training new models, but concrete configuration values for evaluation (e.g., temperature, top-p) are also not specified in the main text.