SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
Authors: Xu Zhang, Jin Yuan, Hanwang Zhang, Guojin Zhong, Yongsheng Zang, Jiacheng Lin, Zhiyong Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in Seg Captioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input. |
| Researcher Affiliation | Academia | 1Hunan University, China 2Nanyang Technological University, Singapore EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | MSCOCO (Lin et al. 2014) comprises 123, 287 images, each annotated with five captions. ... Flickr30k Entities (Plummer et al. 2015) is an extension of Flickr30k (Young et al. 2014), consisting of 31, 000 images with five captions for each one. |
| Dataset Splits | Yes | We adhere to the Karpathy split (Karpathy and Fei-Fei 2015) to allocate 113, 287 images for training, 5, 000 for validation, and 5, 000 for testing. ... We follow the split suggested by Karpathy (Cornia et al. 2019), designating 29, 000 images for training, and 1, 000 for validation and testing, respectively. |
| Hardware Specification | Yes | The SGDiff network is optimized using the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.05 on two A6000 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not specify any software frameworks or libraries with version numbers. |
| Experiment Setup | Yes | The SGDiff network is optimized using the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.05 on two A6000 GPUs. The optimization process spans 60 epochs with a batch size of 16. ... The optimization process spans 60 epochs with a batch size of 16. |