reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are Large Language Models Fluent in Declarative Process Mining?

Authors: Valeria Fionda, Antonio Ielo, Francesco Ricca

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental evaluation is conducted on a synthetic benchmark dataset. The dataset aims to test the translation capabilities of LLMs under controlled complexity. ... We evaluate the performance of several state-of-the-art LLMs for translating between natural language and Declare specifications: Gemma2, Gemma2-27B, LLa MA3.18B, LLa MA3.1-70B, LLa MA3.2, LLa MA3.3, Mistral Nemo, Qwen2-72B, GPT4-Turbo and GPT4o. ... In this section, we analyze the translation capabilities of the considered LLMs via the specific metrics discussed in Section 4 (i.e., Constraint-based Similarity, Soundness, Completeness, Semantic Equivalence, and Trace-based Similarity) on the synthetic dataset discussed in the previous section.
Researcher Affiliation	Academia	Valeria Fionda1 , Antonio Ielo1 , Francesco Ricca1 1Department of Mathematics and Computer Science, University of Calabria, Italy EMAIL , EMAIL , EMAIL
Pseudocode	No	The paper describes the proposed framework and methods using descriptive text and mathematical formulas (e.g., LTLp grammar, metric definitions) but does not include any clearly labeled pseudocode or algorithm blocks. The procedural steps are explained in prose.
Open Source Code	Yes	All data, prompts and code to reproduce the experiment is available in supplementary material.
Open Datasets	Yes	The experimental evaluation is conducted on a synthetic benchmark dataset. ... All data, prompts and code to reproduce the experiment is available in supplementary material.
Dataset Splits	Yes	For each pair of (n, m), we generated 15 satisfiable models (i.e., models admitting at least one satisfying execution trace) by randomly picking n Declare constraints and instantiating them with activities selected uniformly at random from the m available ones. We considered the following combinations of n and m: (5,3), (5,5), (5,8), (10,3), (10,5), (10,8), (15,5), (15,8), (15,10), (20,5), (20,8), (20,10).
Hardware Specification	No	The paper states that GPT-family LLMs were interacted with via OpenAI APIs and other LLMs were run locally using the ollama project, but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for these local runs or the API calls. Footnote 1: "All LLMs, except for the GPT-family, were run locally using the ollama project."
Software Dependencies	No	The paper mentions several software components like "Open AI APIs", "ollama project", "LTLf solver aaltaf [Li et al., 2020]", "Answer Set Programming (ASP)", and a "system for answer set model counting [Eiter et al., 2024]". However, it does not provide specific version numbers for any of these tools or libraries, which is a requirement for a reproducible description of ancillary software.
Experiment Setup	Yes	To inform the LLM regarding the syntax and semantics of Declare, we provide a fixed prompt to the LLM with a detailed list of Declare constraints, their associated semantics, and examples illustrating their application. ... For instance, the translation from Declare to natural language emphasizes precise, unambiguous English descriptions of constraints... Trace-based Similarity was computed using counterexamples of length up to 10.