reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Logically Consistent Language Models via Neuro-Symbolic Integration

Authors: Diego Calanzone, Stefano Teso, Antonio Vergari

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show how given incomplete factual knowledge, e.g., by providing only a limited number of known facts, the LLM can learn truth beliefs for new facts while keeping logical consistency w.r.t. prior knowledge. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically. Code available at https://github.com/ddidacus/loco-llm. In our experiments, with a single offline training session, LLMs trained with our objective outperform models relying on external solvers, and are more factual and logically consistent in low-data regimes when compared to standard supervised fine-tuning over KBs of facts.
Researcher Affiliation	Academia	Diego Calanzone DISI, University of Trento EMAIL Stefano Teso CIMeC & DISI, University of Trento EMAIL Antonio Vergari School of Informatics, University of Edinburgh EMAIL
Pseudocode	No	The paper includes a diagram in Figure 1 titled "Pipeline of our Logically Consistent (Lo Co) LLMs" which illustrates the process, but it does not contain a structured pseudocode block or an algorithm formally labeled as such.
Open Source Code	Yes	Code available at https://github.com/ddidacus/loco-llm.
Open Datasets	Yes	We train LOCO-LMS on the Belief Bank (Kassner et al., 2021). We use the three splits as in Mitchell et al. (Mitchell et al., 2022): a calibration set of 1, 072 annotated facts about 7 entities of the form (subject, property, true/false) used for training, a silver set of 12, 636 facts about 85 entities used for evaluation, and a set of 2224 valid abstract logical implications. [...] For this purpose, the Concept Net dataset (Speer et al., 2018b), is a rich source of knowledge about entity properties and relationships. [...] We evaluate LOCO-LMS on the Entailment Bank (Dalvi et al., 2022) test split, as proposed by Kassner et al. (2023) to reason on entailment trees.
Dataset Splits	Yes	We use the three splits as in Mitchell et al. (Mitchell et al., 2022): a calibration set of 1, 072 annotated facts about 7 entities of the form (subject, property, true/false) used for training, a silver set of 12, 636 facts about 85 entities used for evaluation, and a set of 2224 valid abstract logical implications. [...] We use 90% and 10% of T1 facts for training and validation, respectively; T2 facts for testing.
Hardware Specification	Yes	we fine-tune our models for 3 epochs with a learning rate fixed to γ = 3 10 4, batch size 4 with gradient accumulation (64/16 steps), on one n Vidia A30 24GB GPU. [...] we fine-tune our models for 5 epochs keeping the learning rate fixed to γ = 3 10 4, batch size 64, on 1 n Vidia A100-40GB GPU.
Software Dependencies	No	The paper mentions several tools and models like "Macaw-Large (Tafjord & Clark, 2021)", "LLa Ma-2 (Touvron et al., 2023)", "Adam W (Loshchilov & Hutter, 2016)", "Lo RA (Hu et al., 2021)", and "Py SDD5 (pys, 2017)", but it does not specify concrete version numbers for any of these software components or libraries, which is required for a reproducible description.
Experiment Setup	Yes	We fine-tune our models for 3 epochs with a learning rate fixed to γ = 3 10 4, batch size 4 with gradient accumulation (64/16 steps), on one n Vidia A30 24GB GPU. We use Adam W (Loshchilov & Hutter, 2016) as optimizer with a default weight decay λ = 10 2. [...] We limit the generation to 4 tokens following the input. We adopt a similar set of hyperparameters to Lo RA: we fine-tune our models for 5 epochs keeping the learning rate fixed to γ = 3 10 4, batch size 64, on 1 n Vidia A100-40GB GPU. We use Adam W (Loshchilov & Hutter, 2016) as optimizer with a default weight decay λ = 10 2. [...] greedy sampling strategy, temperature t = 1.0 and dropout disabled.