Image-to-video Adaptation with Outlier Modeling and Robust Self-learning

Authors: Junbao Zhuo, Shuhui Wang, Zhenghan Chen, Li Shen, Qingming Huang, Huimin Ma

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 3 benchmarks validating the effectiveness of our method. We conduct experiments on 3 image-to-video action recognition benchmarks
Researcher Affiliation Collaboration 1University of Science and Technology Beijing, Beijing, China 2Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, China 3Microsoft (China) Co., Ltd., Beijing, China 4Sun Yat-Sen University, Guangzhou, China 5University of Chinese Academy of Sciences, Beijing, China
Pseudocode No The paper describes the methodology using mathematical formulations and descriptive text, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The codes are available at https://github.com/junbao ZHUO/OMRSI2V.
Open Datasets Yes We conduct experiments on 3 image-to-video benchmarks such as Stanford40 UCF101 (S U), BU101 UCF101 (B U) and EADs HMDB51 (E H) for evaluation. Specifically, in S U, the source image domain is Stanford40 (Yao et al. 2011) and the target video domain is UCF101 (Soomro, Zamir, and Shah 2012). For B U, the source image domain is replaced by BU101 dataset (Ma et al. 2017) and the total 101 classes of both BU101 and UCF101 are selected for evaluation, since the categories from BU101 are completely the same as those on UCF101. For the E H benchmark, the source image domain is EADs (Chen et al. 2021), which consists of Stanford40 and the HII dataset (Tanisik, Zalluhoglu, and Ikizler-Cinbis 2016), and the target video domain is HMDB51 (Kuehne et al. 2011).
Dataset Splits No The paper specifies which common categories are used from various datasets (e.g., '12 common categories across Stanford40 and UCF101', 'total 101 classes of both BU101 and UCF101'), but it does not provide explicit training, validation, or test split percentages, sample counts, or references to predefined splits for reproducibility of the data partitioning. It mentions 'We randomly choose 32 frames over target video for training and uniformly extract 32 frames for inference' which pertains to frame sampling, not dataset splits.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU or CPU models, memory specifications, or cloud computing environments used for running the experiments.
Software Dependencies No The paper mentions software components like 'Res Net-50', 'I3D model based on Inception v1', 'Stochastic Gradient Descent (SGD) optimizer', and 'Fix Match', but it does not provide specific version numbers for any of these or other key software libraries, frameworks, or programming languages.
Experiment Setup Yes We train the Res Net using the Stochastic Gradient Descent (SGD) optimizer. The weight decay, batch size, and momentum are set to 3e 4, 36 and 0.9. We implement a learning rate annealing strategy for iteration k-th as ξk = ξ0 (1 + 0.001 p) 0.75, and ξ0 denotes the initial learning rate. p is a parameter that linearly increases from 0 to 1 as k increases. ξ0 = 3e 3 for S U and E H and ξ0 = 5e 3 for B U. We train the model in stage 1 with 10000 iterations for S U and E H tasks, and with 40000 iterations for the B U task. λb and the λpoc are set to 0.5 and 0.1. We train the modified I3D network using SGD optimizer, setting the weight decay, momentum, and batch size to 0.0001, 0.9 and 16. We train the model with 30 epochs for B U task, and with 20 epochs for others. The initial learning rate for S U, E H and B U are 0.05, 0.05 and 0.1. We utilize a multistep decaying learning rate scheme with a decay factor of 0.1. The milestones for the learning rate decay are set at the midpoint and at 75% epochs. λu = 1. We randomly choose 32 frames over target video for training and uniformly extract 32 frames for inference. τ in Eqn. (23) is set to 0.9, following (Sohn et al. 2020).