FFCI: A Framework for Interpretable Automatic Evaluation of Summarization
Authors: Fajri Koto, Timothy Baldwin, Jey Han Lau
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings. |
| Researcher Affiliation | Academia | Fajri Koto EMAIL Timothy Baldwin EMAIL Jey Han Lau EMAIL School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia |
| Pseudocode | No | The paper describes methods like ROUGE, QAGS, BERTScore, STS-Score, and NSP score using mathematical formulas and descriptive text (e.g., in Sections 3 and 4), but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code used in this paper can be accessed at https://github.com/fajri91/ffci. |
| Open Datasets | Yes | To summarize, our contributions are: (1) we release an annotated dataset for evaluating focus, coverage, and inter-sentential coherence; ... Data and code used in this paper can be accessed at https://github.com/fajri91/ffci. We also utilize well-known public datasets such as CNN/Daily Mail (Hermann et al., 2015), XSUM (Narayan et al., 2018b), and the faithfulness dataset from Maynez et al. (2020). |
| Dataset Splits | Yes | First, we partition our data into training, development, and test splits based on a ratio of 80:10:10, respectively, and fine-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We also used the test sets of CNN/Daily Mail and XSUM for evaluating focus and coverage. |
| Hardware Specification | No | This research was undertaken using the LIEF HPC-GPGPU Facility hosted at The University of Melbourne. This facility was established with the assistance of LIEF Grant LE170100200. This statement describes a general computing facility but lacks specific details such as exact GPU/CPU models or memory amounts. |
| Software Dependencies | No | The paper mentions several software components and libraries like ROUGE, METEOR, Sacre BLEU (Post, 2018), Hugging Face, sentence-transformers (Reimers & Gurevych, 2019b), and spaCy. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | In preliminary experiments, we compared n {1, 2, 3} and found that n = 2 works best for ROUGE, and n = 3 works best for STS-Score and the pre-trained language model scores. For inter-sentential coherence, we fine-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We simply use the [CLS] encoding as the input to an MLP layer. During training, we use early stopping (patience = 5) based on the development set performance. |