reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compositional Scene Understanding through Inverse Generative Modeling

Authors: Yanbo Wang, Justin Dauwels, Yilun Du

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the scene understanding capabilities of our proposed approach across three different tasks. First, we consider a local factor perception task in Section 4.1, where the objective is to infer the center coordinates of objects. We next perform a global factor perception task to predict facial attributes from human faces in Section 4.2. Finally, we demonstrate how our approach can be adapted to pretrained models for zero-shot multi-object perception without any additional training.
Researcher Affiliation	Academia	1TU Delft 2Harvard University. Correspondence to: Yanbo Wang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training Algorithm ... Algorithm 2 Discrete Concept Inference Algorithm ... Algorithm 3 Continuous Concept Inference Algorithm ... Algorithm 4 Concept Number Inference Algorithm ... Algorithm 5 Gradient-based Discrete Concept Inference Algorithm ... Algorithm 6 Gradient-based Zero-Shot Perception Algorithm
Open Source Code	Yes	Code and visualizations are at https://energybased-model.github.io/compositional-inference.
Open Datasets	Yes	We evaluate our approach on the CLEVR dataset (Johnson et al., 2017)... We evaluate our approach on the Celeb A dataset (Liu et al., 2015)... Additionally, we conducted additional evaluations on Clevr Tex (Karazija et al., 2021)
Dataset Splits	Yes	The training set consists of images containing 3-5 objects. To evaluate the generalization ability of our approach on out-of-distribution data, we consider two settings: (1) images from the CLEVR dataset containing 6-8 objects; (2) images from the CLEVRTex dataset containing 6-8 objects. ... The training set consists only female faces labeled with the these attributes, while the out-of-distribution test set comprises solely male faces. ... we manually collected a small real-world dataset consisting of 71 random realistic images from the Internet, each containing two animals from the set {dog, cat, rabbit}. Specifically, this dataset consists of 20 images containing a cat and a dog, 22 images containing a cat and a rabbit, and 29 images containing a dog and a rabbit.
Hardware Specification	Yes	We conducted a comparison of runtime performance between our method and baseline models on an NVIDIA H100 GPU.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers used for its implementation.
Experiment Setup	Yes	We train a conditional latent diffusion model with latent space of 4 channels and resolution 8 8, which uses pretrained VAE to encode input images into the latent space. The latent space image is scaled with a factor of 0.18215. The denoising network adopts the Unet architecture (Ronneberger et al., 2015) as commonly used in diffusion models that takes the latent space image as input along with label conditioning and outputs noise predictions. Specifically, the input for the denoising network is of 8 8 and the cross attention dimension is 2 (the object coordinates dimension is 2) for object discovery and 6 (the one-hot encoding of facial attributes is of dimension 6) for facial feature prediction. We use 1000 diffusion steps and linear beta schedule for training. For other hyperparameters, we use a batch size 128 and a learning rate 2e 5.