reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Hybrid Intelligence Method for Argument Mining

Authors: Michiel van der Meer, Enrico Liscio, Catholijn M. Jonker, Aske Plaat, Piek Vossen, Pradeep K. Murukannaiah

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Hy En A on three citizen feedback corpora. We find that, on the one hand, Hy En A achieves higher coverage and precision than a state-of-the-art automated method when compared to a common set of diverse opinions, justifying the need for human insight. On the other hand, Hy En A requires less human effort and does not compromise quality compared to (fully manual) expert analysis, demonstrating the benefit of combining human and artificial intelligence.
Researcher Affiliation	Academia	Michiel van der Meer EMAIL Leiden Institute for Advanced Computer Science (LIACS) Leiden University Enrico Liscio EMAIL Catholijn M. Jonker EMAIL Interactive Intelligence (II) Delft University of Technology Aske Plaat EMAIL Leiden Institute for Advanced Computer Science (LIACS) Leiden University Piek Vossen EMAIL Computational Linguistics & Text Mining Lab (CLTL) Vrije Universiteit Amsterdam Pradeep K. Murukannaiah EMAIL Interactive Intelligence (II) Delft University of Technology
Pseudocode	No	The paper describes methods and processes in detail, often with figures (e.g., Figure 2: Overview of the Hy En A method) and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	We also provide our code, annotation guidelines, and experimental details in the supplementary materials (van der Meer et al., 2024a).
Open Datasets	Yes	Our opinion corpora are composed of citizens feedback on COVID-19 relaxation measures, a contemporary topic. The feedback was gathered in April and May 2020 using the Participatory Value Evaluation (PVE) method (Mouter et al., 2021). ... Since we use data from a publicly run citizen feedback experiment, we observe that some options attracted more pro comments than others.
Dataset Splits	No	In the first phase of Hy En A, human annotators extract individual key argument lists by analyzing the opinion corpus. ... In each corpus, five annotators annotated 51 opinions each, for a total of 255 opinions per corpus. Of the 51 opinions, the first is selected randomly, and the following 50 are selected by FFT. This number of opinions was empirically selected to make the annotation feasible within a maximum of one hour.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running its experiments or models. It mentions using various models (S-BERT, BERTopic, ChatGPT, Llama) but not the underlying hardware.
Software Dependencies	No	The paper mentions using S-BERT (Reimers & Gurevych, 2019), Huggingface Model Hub, BERTopic (Grootendorst, 2022), Microsoft Azure Translation service, Chat GPT (Ouyang et al., 2022), and Llama (Touvron et al., 2023). However, it does not specify explicit version numbers for these software components or any other libraries used for the implementation.
Experiment Setup	Yes	In the first phase of Hy En A, each annotator extracts a key arguments list from an opinion corpus. In each corpus, five annotators annotated 51 opinions each, for a total of 255 opinions per corpus. Of the 51 opinions, the first is selected randomly, and the following 50 are selected by FFT. ... We instantiate the S-BERT model MS using the Huggingface Model Hub1. ... We train a BERTopic model on each opinion corpus, generating 59, 56, and 72 topics for the young, immune, and reopen corpora, respectively. ... We experiment with two well-known graph clustering algorithms: (1) Louvain clustering (Blondel et al., 2008) uses network modularity to identify groups of vertices based on a resolution parameter r. (2) Self-tuning spectral clustering (Zelnik-Manor & Perona, 2004) uses dimensionality reduction in combination with k-means to obtain clusters, where k is the desired number of clusters. We select the parameters of these algorithms to minimize the error metric E shown in Eq. 3. ... Prompt 1: Chat GPT Consider the context of the COVID-19 pandemic and the following arguments: Argument 1 ... Argument k Write a key argument that summarizes the above arguments, and make it short and concise. Prompt 2: Llama Consider the context of the COVID-19 pandemic and the following arguments: Argument 1 ... Argument k A short and concise key argument that summarizes the above arguments is: