reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Argumentative Reasoning in ASPIC+ under Incomplete Information

Authors: Daphne Odekerken, Tuomo Lehtonen, Johannes P. Wallner, Matti Järvisalo

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions consist of a theoretical analysis of the complexity of deciding stability and relevance as well as first exact algorithms for reasoning about stability and relevance in incomplete ASPIC+ theories. ... Furthermore, we provide an open-source implementation of the algorithms, and show empirically that the implementation exhibits promising scalability on both real-world and synthetic data.
Researcher Affiliation	Academia	Authors Contact Information: Daphne Odekerken, orcid: 0000-0003-0285-0706, EMAIL, Department of Information and Computing Sciences, Utrecht University, and National Police Lab AI, Netherlands Police, Utrecht, The Netherlands; Tuomo Lehtonen, orcid: 0000-0001-6117-4854, EMAIL, Department of Computer Science, University of Helsinki, and Department of Computer Science, Aalto University, Helsinki, Finland; Johannes P. Wallner, orcid: 0000-0002-3051-1966, EMAIL, Institute of Software Engineering and Artificial Intelligence, Graz University of Technology, Graz, Austria; Matti Järvisalo, orcid: 0000-0003-2572-063X, EMAIL, Department of Computer Science, University of Helsinki, Helsinki, Finland.
Pseudocode	Yes	Our algorithmic approach to deciding whether a given queryable is 𝑗-relevant for a given literal is presented as Algorithm 1. ... Algorithm 2. ... We present two separate ASP encodings for deciding the justification status of literals: one (𝜋<-just) taking rule preferences into account, the other (𝜋just) assuming that = . ... Listing 1 Module 𝜋common ... Listing 2 Module Δ𝑗𝑢𝑠𝑡 ... Listing 3 Module Δ<-𝑗𝑢𝑠𝑡
Open Source Code	Yes	Furthermore, we provide an open-source implementation of the algorithms, and show empirically that the implementation exhibits promising scalability on both real-world and synthetic data. ... The implementation is available in open source at https://bitbucket.org/coreo-group/raspic2.
Open Datasets	No	For real-world benchmarks, we generated instances for the stability and relevance problems based on the argumentation system 𝐴𝑆= (L, , R, ) and set of queryables Q used in an inquiry system for the intake of online trade fraud at the Netherlands Police [38]. ... To further study the scalability of our implementations, we also consider synthetic data. For this, we generated argumentation theories and queryable sets that are parametrised by the size of the language \|L\| and rule set size \|R\|. Explanation: The paper uses instances generated based on an inquiry system and generated synthetic data. It does not provide concrete access information (link, DOI, repository, or clear statement of public availability) for either the real-world or synthetic datasets.
Dataset Splits	No	To generate stability instances, we obtained knowledge bases by randomly sampling 25 consistent subsets of each size between 1 and 14 from Q, as well as the empty knowledge base. Similarly, instances for relevance were created for each combination of stability instances and a queryable in Q, randomly sampled from the set of queryables that are not axioms and whose contradictory is not an axiom. Explanation: The paper describes how instances were generated for benchmarks (random sampling of knowledge bases and selection of topic/queryable), but it does not specify traditional training/test/validation splits for a dataset.
Hardware Specification	Yes	All experiments were run on 2.50 GHz Intel Xeon Gold 6248 machines under a per-instance time limit of 600 seconds and memory limit of 32 GB.
Software Dependencies	Yes	We use Clingo [25, 23, 24] (version 5.5.1) as the ASP solver and its incremental (multi-shot) features [24] for implementing the CEGAR algorithms for relevance.
Experiment Setup	Yes	All experiments were run on 2.50 GHz Intel Xeon Gold 6248 machines under a per-instance time limit of 600 seconds and memory limit of 32 GB. ... For the language size (\|L\|), we generated instances for the stability instances with \|L\| {50, 100, 150, 200, 250, 500, 1000, 2500, 5000} and for the relevance instances with \|L\| {50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150}. The number of rules was chosen to be \|R\| { 1 2 \|L\|, \|L\|, 3 2 \|L\|}. The body size of rules was chosen to be between 1 and 5, with one third of the rules having one rule antecedent, another third having two antecedents, and the remaining third was split equally to have three, four, or five antecedents. The literal layer distribution was selected by having 2 3 \|L\| literals with layer 0, each one-tenth of the literals for layers 1, 2, and 3, and the remaining ones with layer 4. The ratio between queryables and literal (\|Q\|/\|L\|) is 0.5. The ratio between axioms and queryables (\|K\|/\|Q\|) is 0.5.