Weakly Supervised Video Scene Graph Generation via Natural Language Supervision
Authors: Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through our extensive experiments, we demonstrate the superiority of NL-VSGG over the simple adoption of 1) a pre-trained Img SGG model, and 2) the WS-Img SGG pipeline to Vid SGG. Our extensive experiments on the Action Genome dataset demonstrate the superiority of NL-VSGG. |
| Researcher Affiliation | Academia | 1KAIST 2ETRI 3Korea University EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes its methodology in sections such as "Temporality-aware Caption Segmentation (TCS)" and "Action Duration Variability-aware Caption-Frame Alignment (ADV)" with textual descriptions and a framework diagram (Figure 3), but it does not include a dedicated pseudocode block or algorithm section. |
| Open Source Code | Yes | Our code is available at https://github.com/rlqja1107/NL-VSGG. |
| Open Datasets | Yes | To train a Vid SGG model without ground-truth localized video scene graphs7, we use three video caption datasets: the Action Genome (Ji et al., 2020) (AG) caption, the MSVD (Chen & Dolan, 2011) caption, and the Activity Net caption (Krishna et al., 2017a) datasets. Furthermore, to validate the generalization on other datasets, we train and evaluate our proposed framework using the Vid HOI (Chiou et al., 2021) dataset |
| Dataset Splits | Yes | To evaluate our proposed NL-VSGG framework, we use the AG dataset containing ground-truth localized video scene graphs with 36 object classes (i.e., Ce) and 25 action classes (i.e., Ca), whose categories are divided into three types, i.e., 3 attention classes, 6 spatial classes, and 16 contacting classes. Following previous studies (Chen et al., 2023; Cong et al., 2021), we use 1,747 videos with 54,429 frames. For the Vid HOI (Chiou et al., 2021) dataset ... we obtain the training and test sets contain 6,366 and 756 videos along with 193,911 and 22,976 frames, respectively. |
| Hardware Specification | Yes | For the experimental environment, we implemented NL-VSGG on both an NVIDIA Ge Force A6000 48GB and an Intel Gaudi-v2. |
| Software Dependencies | Yes | In the TCS module, we use gpt-3.5-turbo in Chat GPT (Open AI, 2023) for an LLM. In the ADV module, DAC (Doveh et al., 2024) is adopted for a vision-language model. Please refer to Appendix D regarding the experiment with an open-source smaller language model (Jiang et al., 2023) and another vision-language model (Wang et al., 2022c). |
| Experiment Setup | Yes | Additionally, β used to determine the number of clusters K is set to 4, and α used in the pseudo-labeling strategy is set to 15%. |