FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

Authors: Fajri Koto, Timothy Baldwin, Jey Han Lau

JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.
Researcher Affiliation Academia Fajri Koto EMAIL Timothy Baldwin EMAIL Jey Han Lau EMAIL School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia
Pseudocode No The paper describes methods like ROUGE, QAGS, BERTScore, STS-Score, and NSP score using mathematical formulas and descriptive text (e.g., in Sections 3 and 4), but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes Data and code used in this paper can be accessed at https://github.com/fajri91/ffci.
Open Datasets Yes To summarize, our contributions are: (1) we release an annotated dataset for evaluating focus, coverage, and inter-sentential coherence; ... Data and code used in this paper can be accessed at https://github.com/fajri91/ffci. We also utilize well-known public datasets such as CNN/Daily Mail (Hermann et al., 2015), XSUM (Narayan et al., 2018b), and the faithfulness dataset from Maynez et al. (2020).
Dataset Splits Yes First, we partition our data into training, development, and test splits based on a ratio of 80:10:10, respectively, and fine-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We also used the test sets of CNN/Daily Mail and XSUM for evaluating focus and coverage.
Hardware Specification No This research was undertaken using the LIEF HPC-GPGPU Facility hosted at The University of Melbourne. This facility was established with the assistance of LIEF Grant LE170100200. This statement describes a general computing facility but lacks specific details such as exact GPU/CPU models or memory amounts.
Software Dependencies No The paper mentions several software components and libraries like ROUGE, METEOR, Sacre BLEU (Post, 2018), Hugging Face, sentence-transformers (Reimers & Gurevych, 2019b), and spaCy. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes In preliminary experiments, we compared n {1, 2, 3} and found that n = 2 works best for ROUGE, and n = 3 works best for STS-Score and the pre-trained language model scores. For inter-sentential coherence, we fine-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We simply use the [CLS] encoding as the input to an MLP layer. During training, we use early stopping (patience = 5) based on the development set performance.