reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

Authors: Fajri Koto, Timothy Baldwin, Jey Han Lau

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising ﬁndings.
Researcher Affiliation	Academia	Fajri Koto EMAIL Timothy Baldwin EMAIL Jey Han Lau EMAIL School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia
Pseudocode	No	The paper describes methods like ROUGE, QAGS, BERTScore, STS-Score, and NSP score using mathematical formulas and descriptive text (e.g., in Sections 3 and 4), but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Data and code used in this paper can be accessed at https://github.com/fajri91/ffci.
Open Datasets	Yes	To summarize, our contributions are: (1) we release an annotated dataset for evaluating focus, coverage, and inter-sentential coherence; ... Data and code used in this paper can be accessed at https://github.com/fajri91/ffci. We also utilize well-known public datasets such as CNN/Daily Mail (Hermann et al., 2015), XSUM (Narayan et al., 2018b), and the faithfulness dataset from Maynez et al. (2020).
Dataset Splits	Yes	First, we partition our data into training, development, and test splits based on a ratio of 80:10:10, respectively, and ﬁne-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We also used the test sets of CNN/Daily Mail and XSUM for evaluating focus and coverage.
Hardware Specification	No	This research was undertaken using the LIEF HPC-GPGPU Facility hosted at The University of Melbourne. This facility was established with the assistance of LIEF Grant LE170100200. This statement describes a general computing facility but lacks specific details such as exact GPU/CPU models or memory amounts.
Software Dependencies	No	The paper mentions several software components and libraries like ROUGE, METEOR, Sacre BLEU (Post, 2018), Hugging Face, sentence-transformers (Reimers & Gurevych, 2019b), and spaCy. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	In preliminary experiments, we compared n {1, 2, 3} and found that n = 2 works best for ROUGE, and n = 3 works best for STS-Score and the pre-trained language model scores. For inter-sentential coherence, we ﬁne-tune bert-base-uncased with learning rate = 5e-5, batch size = 40, and maximum epochs = 20. We simply use the [CLS] encoding as the input to an MLP layer. During training, we use early stopping (patience = 5) based on the development set performance.