EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents

Authors: Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted comprehensive experiments on Event Sum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.
Researcher Affiliation Academia 1Laboratory for Big Data and Decision, National University of Defense Technology 2Department of Computer Science and Technology, Tsinghua University 3College of Information and Communication, National University of Defense Technology EMAIL
Pseudocode No The paper describes methods like 'Automatic Data Construction' and 'Human Annotation' in paragraph form and uses figures to illustrate processes, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code https://github.com/Mzzzhu/Event Sum
Open Datasets Yes We developed Event Sum, the first large-scale Chinese multi-document summarization dataset, automatically constructed from Baidu Baike entries for this task study. ... All data utilized in this work are publicly available and freely accessible, with no inclusion of proprietary or restricted data.
Dataset Splits Yes Finally, we obtain 5,100 instances and split them into training, validation, and testing sets. In Event Sum, each instance corresponds to a dynamic event. ... Event Sum Chinese 4,015/500/585
Hardware Specification No The paper mentions evaluating various LLMs and NLI models (e.g., glm-4-9b, chinese-roberta-wwm-ext) but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train these models.
Software Dependencies No The paper mentions using specific models like 'paraphrase-multilingual-mpnet-base-v2' from 'sentence-transformers' and 'glm-4-9b' for LLMs, and 'chinese-roberta-wwm-ext' for NLI. However, it does not provide specific version numbers for the 'sentence-transformers' library itself or other software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup No The paper states that the assessment was conducted under the "zero-shot setting" and lists the LLMs evaluated. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings used for their experiments, as it primarily evaluates existing LLMs rather than training new ones from scratch or fine-tuning with specific parameters.