reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Combining Induction and Transduction for Abstract Reasoning

Authors: Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer Dunn, Hao Tang, Wei-Long Zheng, Yewen Pu, Kevin Ellis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study this question on ARC by training neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). We train on synthetically generated variations of Python programs that solve ARC training tasks. We find inductive and transductive models solve different kinds of test problems, despite having the same training problems and sharing the same neural architecture: Inductive program synthesis excels at precise computations, and at composing multiple concepts, while transduction succeeds on fuzzier perceptual concepts. Ensembling them approaches human-level performance on ARC.
Researcher Affiliation	Collaboration	1Cornell 2Shanghai Jiao Tong University 4Autodesk
Pseudocode	No	The paper describes methods and processes in natural language and provides Python code examples in the appendix, but it does not contain explicit pseudocode or algorithm blocks with structured, non-code-like steps.
Open Source Code	Yes	Our code, data, and model weights are freely available at https: //github.com/xu3kev/BARC.
Open Datasets	Yes	Testing these neural methods requires a large dataset of function-learning problems, which is challenging to generate because not only must we make novel functions, but also good inputs to those functions. ... To address this challenge, we first generate a deterministic Python function for f, and then a probabilistic program for sampling inputs to f, finally executing those programs to produce input-outputs. ...Our code, data, and model weights are freely available at https: //github.com/xu3kev/BARC.
Dataset Splits	Yes	We report performance on the 400-problem public validation split of ARC, which is harder than the training split.
Hardware Specification	Yes	per device batch size device epcoh weight decay learning rate scheduler type 8 8x A100 3 0 cosine
Software Dependencies	Yes	Therefore the induction model must generate Python code, so we initialize our models with Llama3.1-8B-instruct (Dubey et al., 2024) because it was pretrained on source code.1 Our preliminary experiments suggested Llama3.1-8B-instruct was better than Mistral-7B-v0.3, Qwen27B-Instruct, and deepseek-coder-6.7b-instruct ... Unless otherwise mentioned, we create synthetic datasets with GPT4o-mini and ada-002.
Experiment Setup	Yes	Fine-tuning Hyperparameters: training type lora rank lora alpha learning rate gradient accumulate steps lora finetune 64 64 2e-4 2 per device batch size device epcoh weight decay learning rate scheduler type 8 8x A100 3 0 cosine ... Inference Hyperparameters: temperature: 0.8 (1.0 for the full-data fine-tuned model) ... beam width: 1. engineer results: 40 2. 100k data scale: 20 3. all other experiment results: 3