reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Divide, Align and Conquer Strategy For Program Synthesis

Authors: Jonas Witt, Sebastijan Dumančić, Tias Guns, Claus-Christian Carbon

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate BEN, our implementation of DA&C using standard top-down enumerative synthesis, across two domains: abstract visual reasoning used as a running example throughout the paper (Section 5.2) and string transformation tasks (Section 5.1). Table 4: Results on the training part of the Abstraction and Reasoning Corpus (ARC). Figure 8: Performance on the real-world string transformations data set. 5.2.2 Ablation Analysis
Researcher Affiliation	Academia	Jonas Witt EMAIL University of Bamberg, Markusplatz 3, 96047 Bamberg, Germany Sebastijan Dumanˇci c EMAIL TU Delft, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands Tias Guns EMAIL KU Leuven, Celestijnenlaan 200a, 3001 Leuven, Belgium Claus-Christian Carbon EMAIL University of Bamberg, Markusplatz 3, 96047 Bamberg, Germany
Pseudocode	Yes	Algorithm 1 Overview Divide-Align & Conquer Algorithm 2 Rank Correspondences Algorithm 3 Constrained DNF learner
Open Source Code	Yes	Our code and an overview of the tasks solved by BEN are available on Git Hub.
Open Datasets	Yes	We experiment with a publicly available data set of 130 real-world string transformation tasks from Cropper et al. (2020). Materials. We use the training part of the Abstraction and Reasoning Corpus (ARC) (Chollet, 2020) (Apache 2.0 license), which consists of 400 data sets that each contain 210 examples comprising an input and an output where the outputs are generated by an unspecified program that we wish to synthesize.
Dataset Splits	Yes	Its predictive accuracy on tasks with three training examples is not significantly different from its predictive accuracy on tasks with nine training examples Q1. The reason for this is that BEN only performs synthesis on component tuples and not sets of training examples where the complexity of a single synthesis loop is independent of the number of total training examples in a task. BEN solves over 80% of test examples given 2-3 training examples per task.
Hardware Specification	Yes	All experiments are run on a desktop with a single M1 Max CPU and 64GB of RAM.
Software Dependencies	No	The paper mentions "Python implementation" for BEN and "C++" for the Kaggle winner but does not specify any versions for these languages or any specific libraries or solvers with their version numbers.
Experiment Setup	Yes	The #1 Kaggle agent and BEN were both run at a search depth of four on all tasks. During the actual Kaggle competition, the #1 Kaggle agent ran as an ensemble at different search depths to optimize scheduling. We report more challenging results at a search depth of four. The ARGA agent does not expect a search depth as input; we use a time budget of 2 min. All agents were executed with a time budget of 60s per task.