Are Large Language Models Fluent in Declarative Process Mining?
Authors: Valeria Fionda, Antonio Ielo, Francesco Ricca
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental evaluation is conducted on a synthetic benchmark dataset. The dataset aims to test the translation capabilities of LLMs under controlled complexity. ... We evaluate the performance of several state-of-the-art LLMs for translating between natural language and Declare specifications: Gemma2, Gemma2-27B, LLa MA3.18B, LLa MA3.1-70B, LLa MA3.2, LLa MA3.3, Mistral Nemo, Qwen2-72B, GPT4-Turbo and GPT4o. ... In this section, we analyze the translation capabilities of the considered LLMs via the specific metrics discussed in Section 4 (i.e., Constraint-based Similarity, Soundness, Completeness, Semantic Equivalence, and Trace-based Similarity) on the synthetic dataset discussed in the previous section. |
| Researcher Affiliation | Academia | Valeria Fionda1 , Antonio Ielo1 , Francesco Ricca1 1Department of Mathematics and Computer Science, University of Calabria, Italy EMAIL , EMAIL , EMAIL |
| Pseudocode | No | The paper describes the proposed framework and methods using descriptive text and mathematical formulas (e.g., LTLp grammar, metric definitions) but does not include any clearly labeled pseudocode or algorithm blocks. The procedural steps are explained in prose. |
| Open Source Code | Yes | All data, prompts and code to reproduce the experiment is available in supplementary material. |
| Open Datasets | Yes | The experimental evaluation is conducted on a synthetic benchmark dataset. ... All data, prompts and code to reproduce the experiment is available in supplementary material. |
| Dataset Splits | Yes | For each pair of (n, m), we generated 15 satisfiable models (i.e., models admitting at least one satisfying execution trace) by randomly picking n Declare constraints and instantiating them with activities selected uniformly at random from the m available ones. We considered the following combinations of n and m: (5,3), (5,5), (5,8), (10,3), (10,5), (10,8), (15,5), (15,8), (15,10), (20,5), (20,8), (20,10). |
| Hardware Specification | No | The paper states that GPT-family LLMs were interacted with via OpenAI APIs and other LLMs were run locally using the ollama project, but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for these local runs or the API calls. Footnote 1: "All LLMs, except for the GPT-family, were run locally using the ollama project." |
| Software Dependencies | No | The paper mentions several software components like "Open AI APIs", "ollama project", "LTLf solver aaltaf [Li et al., 2020]", "Answer Set Programming (ASP)", and a "system for answer set model counting [Eiter et al., 2024]". However, it does not provide specific version numbers for any of these tools or libraries, which is a requirement for a reproducible description of ancillary software. |
| Experiment Setup | Yes | To inform the LLM regarding the syntax and semantics of Declare, we provide a fixed prompt to the LLM with a detailed list of Declare constraints, their associated semantics, and examples illustrating their application. ... For instance, the translation from Declare to natural language emphasizes precise, unambiguous English descriptions of constraints... Trace-based Similarity was computed using counterexamples of length up to 10. |