Towards context and domain-aware algorithms for scene analysis
Authors: Ibrahim Serouis, Florence Sèdes
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents an innovative approach to scene analysis in video content, which not only incorporates contextual data but also emphasizes the most significant features during training. Additionally, we introduce a methodology for integrating domain knowledge into our framework. We evaluate our proposed methodology using two comprehensive datasets, demonstrating promising results compared to a baseline study using one of the datasets. These findings underscore the importance of integrating contextual data into multimodal video analysis, while also recognizing the challenges associated with their utilization. |
| Researcher Affiliation | Academia | Ibrahim MOHAMED SEROUIS EMAIL CNRS, IRIT Université de Toulouse, Toulouse, France Florence SÈDES EMAIL CNRS, IRIT Université de Toulouse, Toulouse, France |
| Pseudocode | Yes | Algorithm 1 Training flags generation Algorithm 2 Computing the representations of nodes |
| Open Source Code | Yes | Code samples are available at : https://anonymous.4open.science/r/TMLR-2025-5CE1 |
| Open Datasets | Yes | We evaluate our approach on the Oby Gaze12 dataset (Tores et al. (2024)), which comprises over 1600 scenes of varying lengths annotated with respect to 4 objectification tags We employ the Movie Graphs human-centric situations dataset (Vicol et al. (2018)) for evaluation, which offers graphs with 101 annotated interaction classes |
| Dataset Splits | Yes | For the interaction classification task, we reproduced the same data split as in the baseline study, Li REC (Kukleva et al. (2020)), leading to roughly 12000 training samples and 4000 validation samples. For the objectification detection task, 60% percent of the dataset was allocated for model training, while the remaining 40% was reserved for the validation of results. |
| Hardware Specification | No | Initially, we attempted integrating frame features extracted using X-Clip (Ni et al. (2022)) for computing the representation of the clip directly into model training, by introducing a "frame" node. However, this method resulted in a significant increase in loss for the objectification detection task, from 0.8 to an average of 10, with only a marginal 0.1% improvement in accuracy. Moreover, it substantially elevated memory consumption during graph embedding and update phases, nearing the limits of our computing unit. |
| Software Dependencies | No | The here obtained graphs are then serialized and stored in a .tfrecord files, which will serve as the data source for training the Tensorflow (Abadi et al. (2015)) models. Subsequently, leveraging a Transformers (Wolf et al. (2020)) BERT embedding model for sequence classification, we extract the hidden states of all the lines. Further insights into each architecture are elaborated in Appendix A. with layer support provided by Tensorflow-GNN (Ferludin et al. (2022)). |
| Experiment Setup | Yes | In all considered architectures, key parameters remain consistent, including the dimensionality of message exchanges between nodes and their neighbors across graph update layers, set at 64. Similarly, the number of iterations for message passing remains uniform at 1, as does the message pooling method, employing summation across neighboring nodes. Our models are trained using sparse categorical cross entropy as loss function (Mao et al. (2023)). The batch size for interaction classification was adapted to our baseline study, LIREC (Kukleva et al. (2020)). Both models were trained using an Adam optimizer for 200 epochs; however, training was automatically halted using early stopping (Yao et al. (2007)) once the monitored value (accuracy or binary accuracy) on the validation set exhibited no further improvement for a consecutive period of 40 epochs. For a comprehensive overview of the model implementations, please refer to Table 1, which summarizes the pertinent details. Table 1: Implementation details for both tasks (a) Interaction classification: Optimizer Adam, Learning rate 3 10e 5, Batch size 64, Early stopping top-5 accuracy, Epochs 200, Dropout rate 0.1 (b) Objectification detection: Optimizer Adam, Learning rate cosine decay(10 4, 4800 steps), Batch size 32, Early stopping Validation accuracy, Epochs 200, Dropout rate 0.2 |