A Comprehensive Evaluation on Event Reasoning of Large Language Models
Authors: Zhengwei Tao, Zhi Jin, Yifan Zhang, Xiancai Chen, Haiyan Zhao, Jia Li, Bin Liang, Chongyang Tao, Qun Liu, Kam-Fai Wong
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on EV2 to answer these questions. The results provide insights into event reasoning that: 1) LLMs have the abilities of event reasoning, but are far from satisfactory and are imbalanced in different relations and reasoning paradigms. 2) LLMs embed imbalanced abilities in different relations and reasoning paradigms. 3) LLMs have event schema knowledge. However, LLMs are not aligned with humans in the aspect of leveraging event schema knowledge. Based on the findings, we investigate guiding the LLMs to utilize event schema knowledge. With the guidance, LLMs can perform better event reasoning which sheds light on modeling event knowledge as memory of LLMs to enhance event reasoning. We summarize our contributions as follows: We first comprehensively evaluate event reasoning in both abstraction levels of schema and instance, and various relations and paradigms. We construct a benchmark EV2 which features two levels of evaluation and comprehensive in relations and reasoning paradigms. We conduct extensive experiments to probe how LLMs perform event reasoning. We conclude several insights. |
| Researcher Affiliation | Collaboration | 1School of Computer Science, Peking University, 2Mo E Key Lab. of High Confidence Software Technologies(PKU), China 3Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong 4Mo E Key Lab. of High Confidence Software Technologies(Hong Kong), China 5 Beihang University, 6 Huawei Noah s Ark Lab EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | This process is executed for all nodes to gather components, as Algorithm 1 in the Appendix. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using existing LLMs (GPT series, Mistral-7B, etc.) and refers to prompt details in the Appendix, but does not offer a link to their own implementation or explicitly state that their code is being released. |
| Open Datasets | No | In this paper, we comprehensively evaluate event reasoning in knowledge and abilities. Since there are no existing datasets that are comprehensive in relations and paradigms, and can cover both levels of schema and instance, we introduce a benchmark EV2 for the EValuation of EVent reasoning. |
| Dataset Splits | No | The paper provides counts for each task in the EV2 benchmark (e.g., "S-CEC 492", "I-CEC 491"), but it does not specify how these were split into training, validation, and test sets for the LLM evaluation experiments. It describes the total size of the constructed dataset/benchmark but not the partitioning strategy for experimentation. |
| Hardware Specification | No | The paper mentions evaluating various LLMs, including GPT-4o, GPT-4, GPT-3.5 (closed-source models via official APIs) and several open-source models (e.g., Mistral-7B, Qwen2-7B). However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used by the authors to run these evaluations or host the open-source models. |
| Software Dependencies | No | The paper mentions evaluating specific LLM models (e.g., GPT4o, Mistral-7B) and using "all-mpnet-base-v2 for encoding" and "GPT4 to generate the instance graph Gi." However, it does not list general ancillary software libraries, frameworks, or solvers with specific version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) that would be needed to replicate their experimental setup. |
| Experiment Setup | No | The paper mentions "prompt details in the Appendix" but does not provide specific experimental setup details such as hyperparameter values (learning rate, batch size), model initialization, or training schedules in the main text. The experiments involve evaluating pre-trained LLMs, so these details might be less relevant than for training new models, but concrete configuration values for evaluation (e.g., temperature, top-p) are also not specified in the main text. |