reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spatial Reasoning with Denoising Models

Authors: Christopher Wewer, Bartlomiej Pogodzinski, Bernt Schiele, Jan Eric Lenssen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. ... To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. ... We evaluate SRMs for reasoning on three new benchmark datasets that we introduce in Sec. 4.1.
Researcher Affiliation	Academia	1Max Planck Institute for Informatics, Saarland Informatics Campus, Germany. Correspondence to: Christopher Wewer <EMAIL>, Bart Pogodzinski <EMAIL>, Jan Eric Lenssen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Recursive Sampling of a sum constrained Vector
Open Source Code	Yes	Our project website provides additional videos, code, and the benchmark datasets. ... Our framework, code, and benchmarks are available on our project website for further investigation and development.
Open Datasets	Yes	We introduce three different datasets to quantify reasoning capabilities. They are aimed at different aspects to be tested. The MNIST Sudoku dataset captures complex (NP-hard) dependencies that need to be understood. The Even Pixels dataset is an easier task that can be solved in a greedy fashion. Finally, we introduce the Counting Polygons / Stars FFHQ dataset, which moves closer to real-world images. ... Our project website provides additional videos, code, and the benchmark datasets.
Dataset Splits	No	For testing, we use a held-out dataset split of valid Sudokus and apply random masking of cells with the number of masked ones randomly sampled from the intervals [1, 27], [28, 54], and [55, 81], resulting in three levels of difficulty easy, medium, and hard, respectively. As metrics, we consider accuracy as well as the sum of L1-distances of row-, column-, and block-wise digit histograms to the all ones vector (zero if correct), averaged over all test examples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions general architectural choices like "2D UNets" and "Diffusion Transformers".
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers. It mentions methods like "rectified flows (Liu et al., 2023)" but not the underlying software environment or library versions used for implementation.
Experiment Setup	Yes	Table 5: Hyperparameters used for all experiments. (MNIST Sudoku, Even Pixels, Counting Polygons/Stars FFHQ) Channels, Depth, Channel multipliers, Head channels, Attention resolution, Parameters, Effective batch size, Iterations, Learning Rate.