reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Logical Consistency of Large Language Models in Fact-Checking

Authors: Bishwamittra Ghosh, Sarah Hasan, Naheed Anjum Arafat, Arijit Khan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets... Assessment: We propose consistency measures of LLMs... and demonstrate that existing LLMs lack logical consistency... Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs... (Abstract) \| We conduct experiments to evaluate the logical consistency of LLMs and the impact of supervised fine-tuning on improving consistency. (Section 5, Experimental Results)
Researcher Affiliation	Academia	Bishwamittra Ghosh1, , Sarah Hasan2, , Naheed Anjum Arafat3, Arijit Khan2 1Max Planck Institute for Software Systems, Germany 2Aalborg University, Denmark 3Independent Researcher, USA EMAIL, 2EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 LLM fact-checking with KG contexts (for simple facts) \| Algorithm 2 Supervised fine-tuning for logical consistency
Open Source Code	Yes	We have made our source code and benchmarks available1. 1https://github.com/bishwamittra/llm_logical_consistency
Open Datasets	Yes	We consider three KG benchmarks: Freebase (FB15K), NELL, and a large-scale dataset from OGB: Wiki KG90Mv2 (Wiki) (Hu et al., 2021b). We obtain FB15K and NELL from the codebase of Query2Box (Ren et al., 2020). \| We experiment with a widely known real-world factchecking benchmark containing textual facts: FEVER (Fact Extraction and VERification) (Thorne et al., 2018).
Dataset Splits	Yes	In fine-tuning, we split each fact type into training, evaluation, and test sets having 1K, 5K, and 5K samples, respectively (Table 5).
Hardware Specification	Yes	Fine-tuning is conducted on a cluster with two NVIDIA A40 45 GB GPUs having Intel(R) Xeon(R) Gold 5317 CPU @ 3.00 GHz, 48 core and 1007 GB RAM. Inference is conducted on a cluster with two Tesla V100-PCIE 32 GB GPUs having Intel(R) Xeon(R) Gold 6134M CPU @ 3.20 GHz, 32 core, and 755 GB RAM.
Software Dependencies	No	All experiments are conducted in Python version 3.8.0. We adopted a parameter-efficient fine-tuning (PEFT) method based on QLo RA (Quantized Low Rank Adaptation) (Dettmers et al., 2024)... We adopt the v LLM library (Kwon et al., 2023).
Experiment Setup	Yes	In Llama2-7B and 13B, the hyperparameter choices of QLo RA are the following: learning rate 2 10 5, weight decay 0.001, warm-up ratio 0.03, batch size 8, r 16, 16, and dropout 0.1. In Gemma-2B, we consider learning rate 2 10 6 and keep other hyper-parameters similar to Llama. For high-throughput and memory-efficient inference, we adopt the v LLM library (Kwon et al., 2023), and set temperature to 0 for a deterministic output. We limit the context length in the LLMQuery by selecting relevant triplets of around 1000 tokens. (Section F.1) \| We fine-tune 20 epochs and save the intermediate models. (Section 5.1)