reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Argumentative Large Language Models for Explainable and Contestable Claim Verification

Authors: Gabriel Freedman, Adam Dejl, Deniz Gorur, Xiang Yin, Antonio Rago, Francesca Toni

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Arg LLMs performance experimentally in comparison with state-of-the-art techniques, in the context of the decision-making task of claim verification. We perform an evaluation of Arg LLMs claim verification abilities by comparing four variants thereof with three baselines (two based on direct prompting, plus the chain-of-thought approach (Wei et al. 2022)), on three novel claim verification datasets, adapted from existing datasets (Truthful QA (Lin, Hilton, and Evans 2021), Strategy QA (Geva et al. 2021) and Med QA (Jin et al. 2020)). The evaluation shows that Arg LLMs deliver performance comparable to the baselines with the added benefit of being faithfully explainable.
Researcher Affiliation	Academia	Department of Computing, Imperial College London, UK EMAIL
Pseudocode	No	The paper describes the method using diagrams and textual explanations, for example, Figure 2 outlines the pipeline for Arg LLMs. However, there are no explicit sections or figures labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code and Datasets https://github.com/CLArg-group/argumentative-llms
Open Datasets	Yes	We focus on three claim verification datasets adapted from existing Q/A datasets: Truthful Claim (adapted from Truthful QA (Lin, Hilton, and Evans 2021)), Strategy Claim (adapted from Strategy QA (Geva et al. 2021)) and Med Claim (adapted from Med QA (Jin et al. 2020)).
Dataset Splits	Yes	For our experiments, we select 700 claims from Truthful Claim and Strategy Claim (200 for the prompt selection experiments, as discussed later, and 500 for the main experiments), and 500 claims from the Med Claim dataset for the main experiments. All the datasets we use for our main experiments are balanced (i.e. 250 True and 250 False labels).
Hardware Specification	Yes	All our experiments are executed with two RTX 4090 24GB GPUs on an Intel(R) Xeon(R) w5-2455X.
Software Dependencies	Yes	We use seven main models: Mistral (Mistral7B-Instruct-v0.2) (Jiang et al. 2023), Mixtral (Mixtral8x7B-Instruct-v0.1) (Jiang et al. 2024), Gemma (gemma7b-it) (Mesnard et al. 2024), Gemma 2 (gemma-2-9bit) (Riviere et al. 2024), Llama 3 (Meta-Llama-3-8BInstruct) (Dubey et al. 2024), GPT-3.5-turbo (GPT-3.5-turbo-0125) (Brown et al. 2020) and GPT-4o mini (gpt-4o-mini) (Open AI 2024). ... we quantise them to 4 bits (Dettmers et al. 2023)
Experiment Setup	Yes	For all the models, we use parameters: temperature 0.7, max new tokens for arguments 128, max new tokens for baselines 768, top-p 0.95 and repetition penalty 1.0. ... if the the input claim s final strength is greater than 0.5 it is classified as true, and otherwise as false. ... For θ, we consider two options Depth=1 and Depth=2: BAFs with Depth=1 are composed of the claim along with two generated arguments (a supporter and an attacker); in BAFs with Depth=2, we recursively generate a supporter and an attacker for each of the arguments in Depth=1, giving seven arguments in total.