reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

Authors: Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through our extensive experiments, we demonstrate the superiority of NL-VSGG over the simple adoption of 1) a pre-trained Img SGG model, and 2) the WS-Img SGG pipeline to Vid SGG. Our extensive experiments on the Action Genome dataset demonstrate the superiority of NL-VSGG.
Researcher Affiliation	Academia	1KAIST 2ETRI 3Korea University EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes its methodology in sections such as "Temporality-aware Caption Segmentation (TCS)" and "Action Duration Variability-aware Caption-Frame Alignment (ADV)" with textual descriptions and a framework diagram (Figure 3), but it does not include a dedicated pseudocode block or algorithm section.
Open Source Code	Yes	Our code is available at https://github.com/rlqja1107/NL-VSGG.
Open Datasets	Yes	To train a Vid SGG model without ground-truth localized video scene graphs7, we use three video caption datasets: the Action Genome (Ji et al., 2020) (AG) caption, the MSVD (Chen & Dolan, 2011) caption, and the Activity Net caption (Krishna et al., 2017a) datasets. Furthermore, to validate the generalization on other datasets, we train and evaluate our proposed framework using the Vid HOI (Chiou et al., 2021) dataset
Dataset Splits	Yes	To evaluate our proposed NL-VSGG framework, we use the AG dataset containing ground-truth localized video scene graphs with 36 object classes (i.e., Ce) and 25 action classes (i.e., Ca), whose categories are divided into three types, i.e., 3 attention classes, 6 spatial classes, and 16 contacting classes. Following previous studies (Chen et al., 2023; Cong et al., 2021), we use 1,747 videos with 54,429 frames. For the Vid HOI (Chiou et al., 2021) dataset ... we obtain the training and test sets contain 6,366 and 756 videos along with 193,911 and 22,976 frames, respectively.
Hardware Specification	Yes	For the experimental environment, we implemented NL-VSGG on both an NVIDIA Ge Force A6000 48GB and an Intel Gaudi-v2.
Software Dependencies	Yes	In the TCS module, we use gpt-3.5-turbo in Chat GPT (Open AI, 2023) for an LLM. In the ADV module, DAC (Doveh et al., 2024) is adopted for a vision-language model. Please refer to Appendix D regarding the experiment with an open-source smaller language model (Jiang et al., 2023) and another vision-language model (Wang et al., 2022c).
Experiment Setup	Yes	Additionally, β used to determine the number of clusters K is set to 4, and α used in the pseudo-labeling strategy is set to 15%.