Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization for Scene Graph Generation
Authors: Guoqing Zhang, Shichao Kan, Fanghui Zhang, Wanru Xu, Yue Zhang, Yigang Cen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on datasets such as VG (Krishna et al., 2017), GQA (Hudson & Manning, 2019a) and Open Image V6(Kuznetsova et al., 2020), achieving excellent performance, which demonstrates that our method effectively performs feature reconstruction and mitigates the biased predictions caused by long-tail distribution. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Advanced Rail Autonomous Operation, Bejing Jiaotong University, Beijing, China 2School of Computer Science and Technology, Bejing Jiaotong University, Beijing, China 3Visual Intelligence +X International Cooperation Joint Laboratory of MOE, Bejing Jiaotong University, Beijing, China 4School of Computer Science and Technology, Central South University, Hunan, China 5School of Artificial Intelligence, Henan University, Henan, China 6College of Computer and Information Engineering, Henan Normal University, Henan, China. Correspondence to: Wanru Xu <EMAIL>, Yigang Cen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Feature Reconstruction Training Process Based on Diffusion 1: T, n {T: total number of iteration steps, n: randomly selected step sizes} 2: m = ω(Cp), v = ω(Cp) {Init feature distribution} 3: G m + v N(0, 1) {Init noise input} 4: Gp G 5: for t in random(T, n) do 6: Et Embedding(t) {Initialize time embedding} 7: N Attn(Gp, Et, Et) {Conditional diffusion} 8: N Attn(N , Cp, Cp) {Conditional diffusion} 9: N Attn(N , Tp, Tp) {Conditional diffusion} 10: N ω(N ) φ(ω(Cp)) + ω(Cp) {Noise prediction} 11: Gp γt 1 ( G 1 γt N γt ) {Single-step denoising} 12: end for 13: if is training then 14: loss MSELoss(Gp, Tp) 15: return Gp, loss 16: end if 17: return Gp |
| Open Source Code | Yes | We have uploaded the code to Git Hub: https: //github.com/gavin-gqzhang/No DIS. |
| Open Datasets | Yes | We use the Visual Genome (VG) (Krishna et al., 2017) and GQA (Hudson & Manning, 2019a) datasets for model training and evaluation. Additionally, we employed the Open Images (Kuznetsova et al., 2020) dataset to further evaluate the generalization capability of our method. |
| Dataset Splits | Yes | Both VG and GQA datasets are split using the same method: 70% of the samples are used for training, 30% for testing, with 5,000 samples selected from the training set for validation. Additionally, we employed the Open Images (Kuznetsova et al., 2020) dataset... we used 126,368 images for training, 1,813 for validation, and 5,322 for testing. |
| Hardware Specification | Yes | All experiments are conducted using four NVIDIA 3090 GPUs, each with 24GB of memory. |
| Software Dependencies | No | We use a pre-trained Faster RCNN (Tang et al., 2020; Ren et al., 2015) for object detection, with the detector frozen during all three tasks. The paper mentions a software component (Faster RCNN) but does not provide specific version numbers for it or any other software libraries/frameworks. |
| Experiment Setup | Yes | The training process is divided into two phases. First, the basic scene graph generation model (Zellers et al., 2018; Vaswani et al., 2017; Tang et al., 2019) provides coarse-grained contextual information, which is used for pretraining the Noise-Guided Predicate Representation Extraction module. ... During training, the learning rate is set to 0.001. In the pre-training phase, the batch size is set to 8, and the number of iterations is 60,000. In the feature enhancement phase, the batch size is set to 8, and the number of iterations is 40,000. |