reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Symbolic Rules for Reasoning in Quasi-Natural Language

Authors: Kaiyu Yang, Jia Deng

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark our method on 3 tasks: learning compositional instructions, logical reasoning, and morphological analysis. For compositional instructions, our method not only achieves 100% accuracy on Mini SCAN (Lake et al., 2019) and SCAN (Lake & Baroni, 2018), but also recovers the ground truth rules. For logical reasoning, it achieves state-of-the-art performance on Rule Taker (Clark et al., 2020), including the noisy data paraphrased by crowd workers. For morphological analysis, it learns morphological rules from real-world linguistic data and is competitive with neural seq2seq models in some languages.
Researcher Affiliation	Academia	Kaiyu Yang EMAIL Department of Computer Science Princeton University Jia Deng EMAIL Department of Computer Science Princeton University
Pseudocode	Yes	Algorithm 1: Meta Induce Input : Training data Dtrain = {(Ai, gi)}n i=1; Ai is the assumptions; gi is the goal. Output: Model M consisting of a set of rules
Open Source Code	Yes	The code is available at https://github.com/princeton-vl/MetaQNL.jl.
Open Datasets	Yes	We instantiate Meta QNL/Meta Induce on three tasks: learning compositional instructions on Mini SCAN (Lake et al., 2019)/SCAN (Lake & Baroni, 2018), logical reasoning on Rule Taker (Clark et al., 2020), and morphological analysis on SIGMORPHON 2018 (Cotterell et al., 2018).
Dataset Splits	Yes	For SCAN, we train only on the 400 shortest examples and test on four different splits: simple, length, addprim_jump, and addprim_turn_left. ... For each language, they sample a training set of 1K examples and three test sets of 100 examples each (FUT, PST, and OTHER).
Hardware Specification	Yes	On machines with 0 GPUs, 32GB RAM, and 4 CPUs, we run Meta Induce for 5 epochs on 10K training examples, which takes about 20 hours. ... Our experiments take 30 minutes to run on a laptop
Software Dependencies	No	We use backward chaining as the prover and Z3 (De Moura & Bjørner, 2008) as the MAX-SAT solver. ... The soft matching network is implemented by finetuning a T5 model (Raffel et al., 2020). ... using the Adam W optimizer (Loshchilov & Hutter, 2019).
Experiment Setup	Yes	We use forward chaining as the prover and a depth limit of 7. The hyperparameters λ+ and λ are tuned on validation data. ... We finetune the model with a learning rate of 10 4 and a batch size of 32 using the Adam W optimizer (Loshchilov & Hutter, 2019).