reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inductive Learning of Logical Theories with LLMs: A Expressivity-graded Analysis

Authors: João Pedro Gandarela de Souza, Danilo Carvalho, André Freitas

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results indicate that the largest LLMs can achieve competitive results against a SOTA Inductive Logic Programming (ILP) system baseline, but also that tracking long predicate relationship chains is a more difficult obstacle than theory complexity for LLMs. This paper presents a systematic methodology to evaluate the inductive learning properties (in the context of logic theory induction) of LLMs.
Researcher Affiliation	Academia	1Idiap Research Institute 2National Biomarker Centre, CRUK-MI, University of Manchester 3Department of Computer Science, University of Manchester {firstname.lastname}@idiap.ch, {firstname.lastname}@manchester.ac.uk
Pseudocode	Yes	Algorithm 1: Iterative LM theory refinement
Open Source Code	No	The paper states: "Prompt templates used were included in the supplementary material1." and "A reusable and extensible framework for extending and assessing the inductive capabilities of LLMs." However, it does not provide an explicit statement or a direct link to the source code for the methodology described in the paper.
Open Datasets	No	The paper states: "In order to generate datasets for rigorous analysis, this study employed the Ru Da S tool (Cornelio and Thost 2021) to systematically vary parameters such as noise, open-world degree, and missing data." It cites a tool used to generate synthetic datasets but does not provide direct access information (link, DOI, repository) to the specific datasets used in the experiments.
Dataset Splits	No	The paper mentions: "The mean values reported are based on the results obtained from the train set and evaluated on the test set." While it implies train/test splits, it does not specify exact percentages, sample counts, or a detailed methodology for these splits.
Hardware Specification	Yes	For Popper, Llama3-8B-Instruct, Gemma-7B-It and Mixtral-8x7B-Instruct-v0.1, it was conducted on a computer with an Intel(R) Xeon(R) Gold 5217 CPU @ 3.00GHz, 188GB RAM, and 2x NVIDIA RTX A6000 (48GB VRAM) GPUs.
Software Dependencies	Yes	The software used was CUDA 12.3, Py Torch 2.2.2, and Transformers 4.41.2.
Experiment Setup	Yes	(1) employing Popper, with Nu WLS (Chu, Cai, and Luo 2023) and WMax CDCL, varying its time limit parameter from 10 to 800 seconds; (2) applying the proposed iterative LM theory refinement method (Section Proposed Approach ), with parameters Maxiter = 4 and MTthresh = 1.0.