Compositional Scene Understanding through Inverse Generative Modeling
Authors: Yanbo Wang, Justin Dauwels, Yilun Du
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the scene understanding capabilities of our proposed approach across three different tasks. First, we consider a local factor perception task in Section 4.1, where the objective is to infer the center coordinates of objects. We next perform a global factor perception task to predict facial attributes from human faces in Section 4.2. Finally, we demonstrate how our approach can be adapted to pretrained models for zero-shot multi-object perception without any additional training. |
| Researcher Affiliation | Academia | 1TU Delft 2Harvard University. Correspondence to: Yanbo Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Training Algorithm ... Algorithm 2 Discrete Concept Inference Algorithm ... Algorithm 3 Continuous Concept Inference Algorithm ... Algorithm 4 Concept Number Inference Algorithm ... Algorithm 5 Gradient-based Discrete Concept Inference Algorithm ... Algorithm 6 Gradient-based Zero-Shot Perception Algorithm |
| Open Source Code | Yes | Code and visualizations are at https://energybased-model.github.io/compositional-inference. |
| Open Datasets | Yes | We evaluate our approach on the CLEVR dataset (Johnson et al., 2017)... We evaluate our approach on the Celeb A dataset (Liu et al., 2015)... Additionally, we conducted additional evaluations on Clevr Tex (Karazija et al., 2021) |
| Dataset Splits | Yes | The training set consists of images containing 3-5 objects. To evaluate the generalization ability of our approach on out-of-distribution data, we consider two settings: (1) images from the CLEVR dataset containing 6-8 objects; (2) images from the CLEVRTex dataset containing 6-8 objects. ... The training set consists only female faces labeled with the these attributes, while the out-of-distribution test set comprises solely male faces. ... we manually collected a small real-world dataset consisting of 71 random realistic images from the Internet, each containing two animals from the set {dog, cat, rabbit}. Specifically, this dataset consists of 20 images containing a cat and a dog, 22 images containing a cat and a rabbit, and 29 images containing a dog and a rabbit. |
| Hardware Specification | Yes | We conducted a comparison of runtime performance between our method and baseline models on an NVIDIA H100 GPU. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers used for its implementation. |
| Experiment Setup | Yes | We train a conditional latent diffusion model with latent space of 4 channels and resolution 8 8, which uses pretrained VAE to encode input images into the latent space. The latent space image is scaled with a factor of 0.18215. The denoising network adopts the Unet architecture (Ronneberger et al., 2015) as commonly used in diffusion models that takes the latent space image as input along with label conditioning and outputs noise predictions. Specifically, the input for the denoising network is of 8 8 and the cross attention dimension is 2 (the object coordinates dimension is 2) for object discovery and 6 (the one-hot encoding of facial attributes is of dimension 6) for facial feature prediction. We use 1000 diffusion steps and linear beta schedule for training. For other hyperparameters, we use a batch size 128 and a learning rate 2e 5. |