Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Authors: Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model outperforms existing methods in grounded scene graph generation on the Visual Genome and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task s complexity. Furthermore, we demonstrate the broader applicability of Diffuse SG in two important downstream tasks: (1) achieving superior results in a range of grounded scene graph completion tasks, and (2) enhancing grounded scene graph detection models by leveraging additional training samples generated by Diffuse SG.
Researcher Affiliation Academia Bicheng Xu EMAIL University of British Columbia Vector Institute for AI Qi Yan EMAIL University of British Columbia Vector Institute for AI Renjie Liao EMAIL University of British Columbia Vector Institute for AI Canada CIFAR AI Chair Lele Wang EMAIL University of British Columbia Leonid Sigal EMAIL University of British Columbia Vector Institute for AI Canada CIFAR AI Chair
Pseudocode Yes Algorithm 1 Diffuse SG Training Process. Algorithm 2 Diffuse SG Sampler.
Open Source Code Yes Code is available at https://github.com/ubc-vision/Diffuse SG.
Open Datasets Yes We conduct all experiments on the Visual Genome (Krishna et al., 2017) and COCO-Stuff (Caesar et al., 2018) datasets.
Dataset Splits Yes This pre-processed dataset contains 57, 723 training and 5, 000 validation grounded scene graphs with 150 object and 50 relation categories. resulting in 118, 262 training and 4, 999 validation grounded scene graphs.
Hardware Specification No The paper acknowledges general providers of computational resources like the Province of Ontario, the Government of Canada through CIFAR, the Digital Research Alliance of Canada, companies sponsoring the Vector Institute, Advanced Research Computing at the University of British Columbia, John R. Evans Leaders Fund CFI grant and Compute Canada, but does not specify exact hardware models (e.g., specific GPUs, CPUs) used for running the experiments.
Software Dependencies No The paper mentions using 'Stable Diffusion V1.5' as a base model and 'Adam optimizer' for training, but it does not specify version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries critical for replication.
Experiment Setup Yes We use Adam optimizer and learning rate being 0.0002. The EMA coefficients used for evaluation are 0.9999 and 0.999 on the Visual Genome and COCO-Stuff datasets respectively. We use Adam optimizer with β1 being 0.9, β2 being 0.999, and weight decay being 0.01; a constant learning rate 0.00001 is used to train the models. Both models are trained for 200 epochs with a batch size of 120.