Certified Deductive Reasoning with Language Models
Authors: Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah Goodman
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on Pr Onto QA, Proof Writer and Syllogism Validity datasets, Logic Guide significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLa MA (accuracy gains up to 35%), while drastically reducing content effects the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLa MA using their own reasoning traces. We find that Logic Guide is critical: by training only on certified selfgenerated reasoning, models can self-improve, avoiding learning from their own hallucinations. |
| Researcher Affiliation | Academia | Gabriel Poesia, Kanishk Gandhi , Eric Zelikman , Noah D. Goodman Stanford University EMAIL |
| Pseudocode | No | The paper describes algorithms like Constrained Semantic Decoding (CSD) but does not present them in a structured pseudocode or algorithm block within the document. It refers to prior work for CSD details. |
| Open Source Code | Yes | (Our code and data are available at https://github.com/gpoesia/certified-reasoning). |
| Open Datasets | Yes | We first use two recent natural language reasoning datasets: Pr Onto QA (Saparov & He, 2022) and Proof Writer (Tafjord et al., 2021). Both datasets contain reasoning problems with (1) a list of assumptions (e.g. Every dog is a mammal , or Sam is a dog ), and (2) a proposition that can be reasoned about from the assumptions (e.g. Sam is a mammal? ). We use the problems from Proof Writer where the answer can be proved (i.e. ignoring the closed-world assumption and unknown problems, where fully justifying the answer requires meta-logical reasoning). ... We use two tasks to investigate this hypothesis. First, we contrast the results in the different Pr Onto QA ontologies. ... Second, we leverage the Syllogism Validity dataset (Dasgupta et al., 2022). ... Hence we created Deonti QA: a set of 60 new reasoning problems inspired by Deontic Logic (Von Wright, 1951). ... We detail the creation of Deonti QA in the Appendix, and make the dataset available along with our code. ... To that end, we consider Re Clor (Yu et al., 2020), a dataset of logical reasoning problems taken from standardized exams (e.g., LSAT and GMAT, 4-way multiple choice), as well as the 6 tasks in Legal Bench (Guha et al., 2023) related to Diversity Jurisdiction (binary choice given facts about plaintiffs, defendants and claims, determine whether the criteria for diversity jurisdiction are met). |
| Dataset Splits | Yes | We run 2 STa R iterations with LLa MA 13B on Pr Onto QA1, where we attempt 200 random problems equally split between 1 and 5 hops, fine-tune on successful solutions, and evaluate on unseen problems. ... For bootstrapping, we use a random sample of 120 correct solutions from a mixture of Proof Writer and Pr Onto QA problems with 3+ hops, where the original model either used Logic Guide or not. |
| Hardware Specification | Yes | For LLa MA 13B, we ran and fine-tuned the model on an NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions using 'Open AI models', 'Open AI API', 'CSD', 'Peano', and 'Adam8bit optimizer' but does not specify exact version numbers for these software components or any other programming languages or libraries. |
| Experiment Setup | Yes | We fine-tuned for 1 epoch (i.e., seeing each example exactly once) with a batch size of 2 and a learning rate of 2e-5. We used the Adam8bit optimizer with default parameters, reset in each iteration of STa R. |