Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Authors: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Our research reveals several key findings: Firstly, we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. |
| Researcher Affiliation | Collaboration | ESAT-PSI, KU Leuven; University of Science and Technology of China Memory Tensor; Samsung R&D Institute China; Correspondence to: EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided. The methodology describes rule-based reward systems and RL algorithms but not in a structured pseudocode format. |
| Open Source Code | Yes | The code is available at: https://github.com/zifuwanggg/Jigsaw-R1. |
| Open Datasets | Yes | COCO (Lin et al., 2014). This dataset serves as the foundation for training and evaluating jigsaw puzzles. We exclusively use the images and randomly generate the ground truth permutations. CV-Bench (Tong et al., 2024a). This benchmark repurposes standard vision datasets such as COCO with a multimodal context... MMVP (Tong et al., 2024b). Similar to CV-Bench, MMVP adapts classic vision datasets like Image Net (Deng et al., 2009)... SAT (Ray et al., 2024). This synthetic dataset features indoor scenes... Super-CLEVR (Li et al., 2023). This is another synthetic dataset containing various vehicle models... |
| Dataset Splits | Yes | COCO (Lin et al., 2014)... For training, we employ the train2014 split, and for testing, we randomly select 1,000 images from the test2014 split. SAT (Ray et al., 2024)... For testing, we randomly sample 500 questions per task, yielding a total of 2,000 test questions. The remaining 96,924 questions constitute the training set. |
| Hardware Specification | Yes | We measure these training costs on a cluster of eight 64GB AMD MI250X GPUs. |
| Software Dependencies | No | The paper mentions using GRPO (Shao et al., 2024) as the reinforcement learning algorithm, but does not specify its version number or any other software dependencies with their respective versions. |
| Experiment Setup | Yes | We use GRPO (Shao et al., 2024) as the reinforcement learning algorithm. The GRPO iteration ยต = 1, the KL efficient ฮฒ = 0.04 and the clipping value ฯต = 0.2. Given that thinking is substantially more computationally expensive, we perform 1,000 training steps for it, compared to 2,000 steps for non-thinking. In each training step, 64 unique prompts are processed, with each prompt being sampled 8 times to calculate the advantages. The sampling temperature is set to 1, and top-k sampling is used with k = 50. The learning rate initiates at 1e-6 and linearly decays to 0. As for SFT... Both configurations are trained for 1,000 steps, with a batch size of 512. |