EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents
Authors: Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive experiments on Event Sum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information. |
| Researcher Affiliation | Academia | 1Laboratory for Big Data and Decision, National University of Defense Technology 2Department of Computer Science and Technology, Tsinghua University 3College of Information and Communication, National University of Defense Technology EMAIL |
| Pseudocode | No | The paper describes methods like 'Automatic Data Construction' and 'Human Annotation' in paragraph form and uses figures to illustrate processes, but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/Mzzzhu/Event Sum |
| Open Datasets | Yes | We developed Event Sum, the first large-scale Chinese multi-document summarization dataset, automatically constructed from Baidu Baike entries for this task study. ... All data utilized in this work are publicly available and freely accessible, with no inclusion of proprietary or restricted data. |
| Dataset Splits | Yes | Finally, we obtain 5,100 instances and split them into training, validation, and testing sets. In Event Sum, each instance corresponds to a dynamic event. ... Event Sum Chinese 4,015/500/585 |
| Hardware Specification | No | The paper mentions evaluating various LLMs and NLI models (e.g., glm-4-9b, chinese-roberta-wwm-ext) but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train these models. |
| Software Dependencies | No | The paper mentions using specific models like 'paraphrase-multilingual-mpnet-base-v2' from 'sentence-transformers' and 'glm-4-9b' for LLMs, and 'chinese-roberta-wwm-ext' for NLI. However, it does not provide specific version numbers for the 'sentence-transformers' library itself or other software dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | No | The paper states that the assessment was conducted under the "zero-shot setting" and lists the LLMs evaluated. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings used for their experiments, as it primarily evaluates existing LLMs rather than training new ones from scratch or fine-tuning with specific parameters. |