Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
Authors: Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets. We conduct comprehensive experiments to evaluate the effectiveness of our motion-aware contrastive framework. We first describe the experiment settings, covering the evaluation datasets, evaluation metrics, baseline methods, and implementation details. Next, we present quantitative results of our method, then provide ablation study and careful analysis to explore properties of our motion-aware contrastive framework. Eventually, we conduct qualitative analysis to concretely examine its behavior. |
| Researcher Affiliation | Academia | 1 Institute of Data Science (IDS), National University of Singapore, Singapore 2 Nanyang Technological University (NTU), Singapore, 3 Tongji University, China. All listed institutions are academic universities. |
| Pseudocode | Yes | Algorithm 1: Computing the optimal transport distance |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository for the methodology described. |
| Open Datasets | Yes | We assess the effectiveness of our method on natural and 4D video inputs. The corresponding dataset to each input type is as follows: Open-domain Panoptic video scene graph generation (Open PVSG) (Yang et al. 2023): Open PVSG consists of scene graphs and associated segmentation masks with respect to subject and object nodes in the scene graph. Panoptic scene graph generation for 4D (PSG4D) (Yang et al. 2024): The PSG4D dataset is divided into two groups, i.e. PSG4D-GTA and PSG4D-HOI. |
| Dataset Splits | No | The paper mentions training, fine-tuning, and validation processes, but does not provide specific percentages or counts for training, validation, and test splits for the datasets used. It refers to established datasets like Open PVSG and PSG4D, which likely have standard splits, but these are not explicitly detailed in the paper. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using optimizers like Adam W and Adam, and models such as Mask2Former, Video K-Net, Uni Track tracker, ResNet-101, and DKNet. However, it does not specify version numbers for these software components or any programming languages/libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For fair comparison, we experiment our contrastive framework with both IPS+T and VPS as segmentation module for panoptic video scene graph generation. In the former case, we leverage the Uni Track tracker (Wang et al. 2021) combined with Mask2Former model (Cheng et al. 2022), which is initialized from the best-performing COCO-pretrained weights and fine-tuned for 8 epochs using Adam W optimizer with a batch size of 32, learning rate of 0.0001, weight decay of 0.05, and gradient clipping with a max L2 norm of 0.01. In the latter case, we utilize Video K-Net (Li et al. 2022), also initial- ized from COCO-pretrained weights and fine-tuned with the same strategy as IPS+T. In the relation classification step, we conduct fine-tuning with a batch size of 32, employing the Adam optimizer with a learning rate of 0.001. Based on validation, we adopt a threshold γ = 9.0 and a margin α = 10.0. We set the maximum number of iterations Niter to 1,000. |