reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

System 1.x: Learning to Balance Fast and Slow Planning with Language Models

Authors: Swarnadeep Saha, Archiki Prasad, Justin Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with two diverse planning tasks Maze Navigation and Blocksworld show that our System-1.x Planner outperforms a System-1 Planner, a System-2 Planner trained to approximate A search, and also a symbolic planner (A search), given a state exploration budget.
Researcher Affiliation	Academia	Swarnadeep Saha Archiki Prasad Justin Chih-Yao Chen Peter Hase Elias Stengel-Eskin Mohit Bansal UNC Chapel Hill
Pseudocode	Yes	Algorithm 1 Training Data Generation for System-1.x Controller
Open Source Code	Yes	Code available at https://github.com/swarna Hub/System-1.x
Open Datasets	Yes	REPRODUCIBILITY STATEMENT We are making our code and data available in the supplementary material to enable replication of our findings. We randomly generate a balanced dataset of 4K planning problems (split into 3200/400/400 samples) with 5x5 mazes, 40% of the cells containing obstacles, and having optimal plan lengths between 1 to 8. Following the data creation algorithm in Bohnet et al. (2024), we generate problems consisting of 4-7 blocks (without repetition).
Dataset Splits	Yes	We randomly generate a balanced dataset of 4K planning problems (split into 3200/400/400 samples) with 5x5 mazes... From there, we create a train/validation/test split of 3000/250/200 samples where the train and the validation split consist of samples with plan lengths 1-6 and the test split consists of samples with plan lengths 7-10.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	Yes	We choose Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as the base LLM and fine-tune all our components with Lo RA (Hu et al., 2021) with a rank of 8 for a maximum of 3 epochs and a batch size of 4, resulting in three adapters for System-1, System-2, and the controller.
Experiment Setup	Yes	We choose Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as the base LLM and fine-tune all our components with Lo RA (Hu et al., 2021) with a rank of 8 for a maximum of 3 epochs and a batch size of 4, resulting in three adapters for System-1, System-2, and the controller.