reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PerSEval: Assessing Personalization in Text Summarizers

Authors: Sourish Dasgupta, Ankush Chander, Tanmoy Chakraborty, Parth Borad, Isha Motiyani

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that (i) Per SEval is reliable w.r.t human-judgment correlation (Pearson s r = 0.73; Spearman s ρ = 0.62; Kendall s τ = 0.42), (ii) Per SEval has high rank-stability, (iii) Per SEval as a rankmeasure is not entailed by EGISES-based ranking, and (iv) Per SEval can be a standalone rank-measure without the need of any aggregated ranking.
Researcher Affiliation	Academia	1Dhirubhai Ambani Institute of Information & Communication Technology, India 2Indian Institute of Technology Delhi, India Corresponding authors: EMAIL, EMAIL
Pseudocode	No	The paper provides mathematical formulations and definitions but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Code: https://github.com/KDM-LAB/Perseval-TMLR
Open Datasets	Yes	Microsoft PENS Dataset (News Domain). Our study, as in (Vansh et al., 2023), assesses models using test data from the PENS dataset provided by Microsoft Research (Ao et al., 2021)4. Open AI CNN/Daily Mail Dataset (News Domain). To understand the applicability of Per SEval on mainstream gold-standard news datasets, we design an indirect evaluation methodology with the Open AI CNN/DM dataset (validation and test) released by Stiennon et al. (2020). Open AI TL;DR (Reddit) Dataset (Open Domain). To understand the broader applicability of Per SEval, we also appropriated the Open AI TL;DR dataset Stiennon et al. (2020). This dataset is a collection of 123,169 Reddit posts adopted from the dataset by Völske et al. (2017)
Dataset Splits	Yes	Our study, as in (Vansh et al., 2023), assesses models using test data from the PENS dataset provided by Microsoft Research (Ao et al., 2021)4. Open AI CNN/Daily Mail Dataset (News Domain). To understand the applicability of Per SEval on mainstream gold-standard news datasets, we design an indirect evaluation methodology with the Open AI CNN/DM dataset (validation and test) released by Stiennon et al. (2020). A subset of the validation dataset comprises 1038 posts that were fed into 13 policies to generate 7713 summaries.
Hardware Specification	Yes	System specifications: Machine architecture: x86_64; CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz; CPU Cores: 16; Thread(s) per core: 2.
Software Dependencies	No	The paper mentions software components such as ROUGE, BLEU, METEOR, Bert Score, Jensen-Shannon Distance, and Info LM, and also BERT Base (uncased; 110M params) as a pre-trained Masked Language Model. However, it does not provide specific version numbers for any of these software libraries or tools.
Experiment Setup	Yes	Per SEval hyper-parameters: α = 3, β = 1.7 (optimal β; see 3), and γ = 4. An 11-point hyper-parameter ablation study shows that the optimal correlation is at β = 1.7