Logical Consistency of Large Language Models in Fact-Checking
Authors: Bishwamittra Ghosh, Sarah Hasan, Naheed Anjum Arafat, Arijit Khan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from knowledge graphs (KGs). Our contributions are threefold. Benchmark: We introduce three logical fact-checking datasets... Assessment: We propose consistency measures of LLMs... and demonstrate that existing LLMs lack logical consistency... Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs... (Abstract) | We conduct experiments to evaluate the logical consistency of LLMs and the impact of supervised fine-tuning on improving consistency. (Section 5, Experimental Results) |
| Researcher Affiliation | Academia | Bishwamittra Ghosh1, , Sarah Hasan2, , Naheed Anjum Arafat3, Arijit Khan2 1Max Planck Institute for Software Systems, Germany 2Aalborg University, Denmark 3Independent Researcher, USA EMAIL, 2EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 LLM fact-checking with KG contexts (for simple facts) | Algorithm 2 Supervised fine-tuning for logical consistency |
| Open Source Code | Yes | We have made our source code and benchmarks available1. 1https://github.com/bishwamittra/llm_logical_consistency |
| Open Datasets | Yes | We consider three KG benchmarks: Freebase (FB15K), NELL, and a large-scale dataset from OGB: Wiki KG90Mv2 (Wiki) (Hu et al., 2021b). We obtain FB15K and NELL from the codebase of Query2Box (Ren et al., 2020). | We experiment with a widely known real-world factchecking benchmark containing textual facts: FEVER (Fact Extraction and VERification) (Thorne et al., 2018). |
| Dataset Splits | Yes | In fine-tuning, we split each fact type into training, evaluation, and test sets having 1K, 5K, and 5K samples, respectively (Table 5). |
| Hardware Specification | Yes | Fine-tuning is conducted on a cluster with two NVIDIA A40 45 GB GPUs having Intel(R) Xeon(R) Gold 5317 CPU @ 3.00 GHz, 48 core and 1007 GB RAM. Inference is conducted on a cluster with two Tesla V100-PCIE 32 GB GPUs having Intel(R) Xeon(R) Gold 6134M CPU @ 3.20 GHz, 32 core, and 755 GB RAM. |
| Software Dependencies | No | All experiments are conducted in Python version 3.8.0. We adopted a parameter-efficient fine-tuning (PEFT) method based on QLo RA (Quantized Low Rank Adaptation) (Dettmers et al., 2024)... We adopt the v LLM library (Kwon et al., 2023). |
| Experiment Setup | Yes | In Llama2-7B and 13B, the hyperparameter choices of QLo RA are the following: learning rate 2 10 5, weight decay 0.001, warm-up ratio 0.03, batch size 8, r 16, 16, and dropout 0.1. In Gemma-2B, we consider learning rate 2 10 6 and keep other hyper-parameters similar to Llama. For high-throughput and memory-efficient inference, we adopt the v LLM library (Kwon et al., 2023), and set temperature to 0 for a deterministic output. We limit the context length in the LLMQuery by selecting relevant triplets of around 1000 tokens. (Section F.1) | We fine-tune 20 epochs and save the intermediate models. (Section 5.1) |