reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CraftFactory: A Conditioned Control Policy Benchmark for Compositional Generalization

Authors: Jinbing Hou, Youpeng Zhao, Jian Zhao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this gap, we propose Craft Factory, a benchmark designed for evaluating compositional generalization in an interactive control environment. This benchmark introduces a new challenge for testing compositional generalization in a more realistic and comprehensive manner. By leveraging Craft Factory, we aim to promote the development of more advanced compositional generalization methods, thereby contributing to the broader field of general AI. We conducted experiments using our method alongside three popular compositional generalization approaches. The results (see Table 2) indicate that all four methods, including ours, have significant room for improvement.
Researcher Affiliation	Industry	Polixir Technologies, Nanjing, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies and processes through textual descriptions and mathematical formulations (e.g., equations 1-8) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Aubing-H/craftfactory
Open Datasets	Yes	Craft Factory builds upon the Mine RL workbench crafting scenario(Guss et al. 2019), providing a vision-based, interactive, and open-ended environment for AI research.
Dataset Splits	No	The paper states: "For training, approximately 100 trajectories were selected for each task. For testing, we introduced one or two novel test cases." However, it does not provide exact percentages or specific sample counts for the overall training, validation, and testing sets, nor does it refer to predefined splits in a way that would allow for reproducible data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using "VPT (Video Pre-Train) backbone(Baker et al. 2022)" and a "Fi LM (Feature-wise Linear Modulation) Conditioned Layer(Perez et al. 2018; Cai et al. 2023a)", which are architectural components or methods. However, it does not specify any software libraries with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup	No	The paper mentions that sequences are padded to a uniform length of 10 and embeddings are transformed into a 512-length embedding. It also states "For training, approximately 100 trajectories were selected for each task." However, crucial hyperparameters such as learning rate, batch size, optimizer, number of training epochs, or specific training schedules are not provided.