reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Authors: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Our research reveals several key findings: Firstly, we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning.
Researcher Affiliation	Collaboration	ESAT-PSI, KU Leuven; University of Science and Technology of China Memory Tensor; Samsung R&D Institute China; Correspondence to: EMAIL
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided. The methodology describes rule-based reward systems and RL algorithms but not in a structured pseudocode format.
Open Source Code	Yes	The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.
Open Datasets	Yes	COCO (Lin et al., 2014). This dataset serves as the foundation for training and evaluating jigsaw puzzles. We exclusively use the images and randomly generate the ground truth permutations. CV-Bench (Tong et al., 2024a). This benchmark repurposes standard vision datasets such as COCO with a multimodal context... MMVP (Tong et al., 2024b). Similar to CV-Bench, MMVP adapts classic vision datasets like Image Net (Deng et al., 2009)... SAT (Ray et al., 2024). This synthetic dataset features indoor scenes... Super-CLEVR (Li et al., 2023). This is another synthetic dataset containing various vehicle models...
Dataset Splits	Yes	COCO (Lin et al., 2014)... For training, we employ the train2014 split, and for testing, we randomly select 1,000 images from the test2014 split. SAT (Ray et al., 2024)... For testing, we randomly sample 500 questions per task, yielding a total of 2,000 test questions. The remaining 96,924 questions constitute the training set.
Hardware Specification	Yes	We measure these training costs on a cluster of eight 64GB AMD MI250X GPUs.
Software Dependencies	No	The paper mentions using GRPO (Shao et al., 2024) as the reinforcement learning algorithm, but does not specify its version number or any other software dependencies with their respective versions.
Experiment Setup	Yes	We use GRPO (Shao et al., 2024) as the reinforcement learning algorithm. The GRPO iteration µ = 1, the KL efficient β = 0.04 and the clipping value ϵ = 0.2. Given that thinking is substantially more computationally expensive, we perform 1,000 training steps for it, compared to 2,000 steps for non-thinking. In each training step, 64 unique prompts are processed, with each prompt being sampled 8 times to calculate the advantages. The sampling temperature is set to 1, and top-k sampling is used with k = 50. The learning rate initiates at 1e-6 and linearly decays to 0. As for SFT... Both configurations are trained for 1,000 steps, with a batch size of 512.