reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Certified Deductive Reasoning with Language Models

Authors: Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah Goodman

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on Pr Onto QA, Proof Writer and Syllogism Validity datasets, Logic Guide significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLa MA (accuracy gains up to 35%), while drastically reducing content effects the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLa MA using their own reasoning traces. We find that Logic Guide is critical: by training only on certified selfgenerated reasoning, models can self-improve, avoiding learning from their own hallucinations.
Researcher Affiliation	Academia	Gabriel Poesia, Kanishk Gandhi , Eric Zelikman , Noah D. Goodman Stanford University EMAIL
Pseudocode	No	The paper describes algorithms like Constrained Semantic Decoding (CSD) but does not present them in a structured pseudocode or algorithm block within the document. It refers to prior work for CSD details.
Open Source Code	Yes	(Our code and data are available at https://github.com/gpoesia/certified-reasoning).
Open Datasets	Yes	We first use two recent natural language reasoning datasets: Pr Onto QA (Saparov & He, 2022) and Proof Writer (Tafjord et al., 2021). Both datasets contain reasoning problems with (1) a list of assumptions (e.g. Every dog is a mammal , or Sam is a dog ), and (2) a proposition that can be reasoned about from the assumptions (e.g. Sam is a mammal? ). We use the problems from Proof Writer where the answer can be proved (i.e. ignoring the closed-world assumption and unknown problems, where fully justifying the answer requires meta-logical reasoning). ... We use two tasks to investigate this hypothesis. First, we contrast the results in the different Pr Onto QA ontologies. ... Second, we leverage the Syllogism Validity dataset (Dasgupta et al., 2022). ... Hence we created Deonti QA: a set of 60 new reasoning problems inspired by Deontic Logic (Von Wright, 1951). ... We detail the creation of Deonti QA in the Appendix, and make the dataset available along with our code. ... To that end, we consider Re Clor (Yu et al., 2020), a dataset of logical reasoning problems taken from standardized exams (e.g., LSAT and GMAT, 4-way multiple choice), as well as the 6 tasks in Legal Bench (Guha et al., 2023) related to Diversity Jurisdiction (binary choice given facts about plaintiffs, defendants and claims, determine whether the criteria for diversity jurisdiction are met).
Dataset Splits	Yes	We run 2 STa R iterations with LLa MA 13B on Pr Onto QA1, where we attempt 200 random problems equally split between 1 and 5 hops, fine-tune on successful solutions, and evaluate on unseen problems. ... For bootstrapping, we use a random sample of 120 correct solutions from a mixture of Proof Writer and Pr Onto QA problems with 3+ hops, where the original model either used Logic Guide or not.
Hardware Specification	Yes	For LLa MA 13B, we ran and fine-tuned the model on an NVIDIA A100 80GB GPU.
Software Dependencies	No	The paper mentions using 'Open AI models', 'Open AI API', 'CSD', 'Peano', and 'Adam8bit optimizer' but does not specify exact version numbers for these software components or any other programming languages or libraries.
Experiment Setup	Yes	We fine-tuned for 1 epoch (i.e., seeing each example exactly once) with a batch size of 2 and a learning rate of 2e-5. We used the Adam8bit optimizer with default parameters, reset in each iteration of STa R.