Scene Graph-Grounded Image Generation

Authors: Fuyun Wang, Tong Zhang, Yuanzhi Wang, Xiaoya Zhang, Xin Liu, Zhen Cui

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of the method both quantitatively and qualitatively. Extensive quantitative and qualitative experiments demonstrate that the proposed SGG-IG method generates perceptually appealing images with distinct textures and coherent scene relationships, achieving superior performance. We conduct our experiments on two standard benchmarks, Visual Genome (Krishna et al. 2017) and COCO-Stuff (Caesar, Uijlings, and Ferrari 2018). Evaluation metrics We consider two standard evaluation metrics, i.e., Inception Score (IS) (Szegedy et al. 2016) and Fr echet Inception Distance (FID) (Heusel et al. 2017), that are used to measure the quality of scene graph-to-image generation. Baseline Comparison Results Quantitative Results As shown in Tab. 1, SGG-IG achieves state-of-the-art performance in both FID and IS metrics compared to other baseline methods at 128 128 and 256 256 resolution settings.
Researcher Affiliation Collaboration 1Nanjing University of Science and Technology, China. 2Nanjing Seeta Cloud Technology, China. EMAIL, EMAIL
Pseudocode No The paper describes methods and formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code will be available at our site1. 1https://github.com/fuyunwang/SGG-IG
Open Datasets Yes Following previous works (Johnson, Gupta, and Fei-Fei 2018; Li et al. 2019; Ashual and Wolf 2019) on scene graph-toimage generation, we conduct our experiments on two standard benchmarks, Visual Genome (Krishna et al. 2017) and COCO-Stuff (Caesar, Uijlings, and Ferrari 2018).
Dataset Splits Yes Following prior studies(Cheng et al. 2023; Zheng et al. 2023; Yang et al. 2023), we partition the data into training, validation, and test sets with proportions of 80%, 10%, and 10% respectively. Finally, small and uncommon objects were removed, yielding 62,565 images in the VG dataset for training, 5,062 for validation, and 5096 for testing. COCO-Stuff collects 164K images from COCO 2017, which contains 80 object categories and 91 object bounding boxes and pixel-level segmentation masks. Following (Yang et al. 2022; Zheng et al. 2023), we first employ footage from the COCO 2017 dataset to partition the challenge subset, comprising 40,000/5,000/5,000 images, correspondingly, for the training/validation/testing development sets. After that, we disregard objects occupying less than 2% of the image and filter out images that contain 3 to 8 objects, yielding 24,972 images for training, 1024 for validation, and 2048 for testing.
Hardware Specification Yes All experiments are conducted on the NVIDIA Ge Force RTX 4090 GPUs.
Software Dependencies No The paper mentions using Adam optimizer and stablediffusion-v1-4 as initialization weights but does not specify version numbers for general software dependencies like Python or PyTorch.
Experiment Setup Yes In the pre-training stage, a standard multi-layer graph convolutional network is employed to implement the relation embedding module, where the node objects and relation edges of the input scene graph are with processed into 512-dimensionality vectors. For the pre-training process with mask self-supervision, we set the random mask ratio to 0.3, and for the pre-training process with spatial constraint, we use CLIP image encoder to encode the images. In the fine-tuning stage, we adopt stablediffusion-v1-4 (Rombach et al. 2022) as the initialization parameter weights of SGG-IG, and the sampling process uses DDIM (Song, Meng, and Ermon 2020) sampling with 100 sampling steps. In both the pre-training and fine-tuning stages, we employ the Adam optimizer (Kingma and Ba 2014) with learning rates of 5e-4 and 1e-6, correspondingly. We set batch size to 2, and perform 700,000 iterations and 30,000 iterations in the fine-tuning stage.