Efficient Self-Supervised Video Hashing with Selective State Spaces
Authors: Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate S5VH s improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. ... We conduct extensive experiments on 4 datasets: Activity Net, FCVID, UCF101, and HMDB51, demonstrating that S5VH outperforms state-of-the-art baselines under various setups and transfers better across datasets. ... Additionally, we provide comprehensive ablations and analyses, focusing on network architecture and training strategy. |
| Researcher Affiliation | Collaboration | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Harbin Institute of Technology, Shenzhen ... 4Meituan, Beijing |
| Pseudocode | No | The paper describes an optimization problem for hash center generation using equations and prose (e.g., 'Optimization for Hash Center Generation' section), but it does not present this as a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/gimpong/AAAI25-S5VH |
| Open Datasets | Yes | We conduct experiments on 4 benchmark datasets. (i) Activity Net (Caba Heilbron et al. 2015) ... (ii) FCVID (Jiang et al. 2017) ... (iii) UCF101 (Soomro, Zamir, and Shah 2012) ... (iv) HMDB51 (Kuehne et al. 2011) |
| Dataset Splits | Yes | (i) Activity Net ... using 9,722 videos for training. We uniformly sample 1,000 videos across 200 categories in the validation set as queries, and the remaining 3,758 videos as the database. ... (iii) UCF101 ... We use 9,537 videos for training and the database, and 3,783 videos from the test set as the query set. (iv) HMDB51 ... We use 3,570 videos for both training and database and 1,530 videos from the test set are designated as the query set. |
| Hardware Specification | No | We perform stress testing with them in the same computational environment, taking 5 samples as a unit to probe the maximally affordable batchsizes and measuring the average inference time per sample. |
| Software Dependencies | No | For the model training, we choose the Adam W optimizer with default parameters in Pytorch |
| Experiment Setup | Yes | For the model training, we choose the Adam W optimizer with default parameters in Pytorch, and employ a cosine annealed learning rate scheduling from 5e-4 to 1e-5. The models are trained for up to 350 epochs with 5-patience early-stopping to prevent overfitting. The default hyperparameter configurations are as below: (i) We set the mask ratio ρ = |M|/Nt to 0.75 on the FCVID dataset and 0.5 on the rest of the datasets. (ii) The temperature factor τ in Equations (20) and (21) is set 0.5. (iii) The number of semantic centers Nc is set to 450 on FCVID and 100 on the other datasets. ... we set the 6 layers for the encoder and 1 layer for the decoder. The latent dimensions of the encoder and decoder are set to 256 and 192, respectively. |