BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition

Authors: Fei Long, Xiaoou Li, Jiaming Lv, Haoyuan Yang, Xianjun Cheng, Peihua Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first describe the experimental setup (5.1). Then we compare to state-of-the-art methods in light of performance (5.2) and cost (5.3). We finally conduct ablation study (5.4) and provide qualitative analysis (5.5). Our method achieves strong performance across a range of video recognition tasks, including zero-shot, few-shot, base-to-novel, and fully supervised recognition, demonstrating its ability to capture subtle spatio-temporal cues critical for video action understanding. To facilitate fast ablation, we pretrain on K400tiny (Rasheed et al., 2023) where each class has 100 training videos, and evaluate zero-shot recognition on K600 and K9shot (K = 2, 16) recognition on HMDB-51 and SSv2.
Researcher Affiliation Academia 1School of Information and Communication Engineering, Dalian University of Technology, Dalian, China. 2School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China.
Pseudocode No The paper describes its methodology using mathematical equations and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on five widely used action recognition datasets, i.e., Kinetics-400 (K400) (Carreira & Zisserman, 2017), Kinetics-600 (K600) (Carreira et al., 2018), HMDB-51 (Kuehne et al., 2011), UCF-101 (Soomro et al., 2012) and Something Something V2 (SSv2) (Goyal et al., 2017).
Dataset Splits Yes We conduct experiments on five widely used action recognition datasets, i.e., Kinetics-400 (K400)... It provides approximately 240K training videos, 20K validation videos and 40K test videos... HMDB-51... Three training/validation splits are predefined, each of which contains 3,570 training and 1,530 validation videos. UCF-101... It defines three splits for training and validation, each containing roughly 9.5K training and 3.7K validation videos. Something-Something v2 (SSv2)... providing about 168K training and 24K validation videos. For Zero-shot recognition... On HMDB-51 and UCF-101, evaluations are conducted on the three official test splits. On K600... we use three splits randomly selected... For Few-shot recognition... We randomly sample K videos per category for training, while testing on the first validation split for HMDB-51 and UCF-101, along with the full validation split for SSv2. For Base-to-novel generalization... Three training splits per dataset are constructed. On HMDB-51 and UCF-101, only the first training split is used for training and validation; on K400 and SSv2, evaluations are performed on the full validation set.
Hardware Specification Yes All experiments are conducted using GeForce RTX 4090 GPUs with the PyTorch framework.
Software Dependencies No The paper mentions using the 'PyTorch framework' but does not specify a version number or other key software dependencies with their versions.
Experiment Setup Yes Pretraining on K400 For zero-shot pretraining, we use the AdamW optimizer with β1 = 0.9, β2 = 0.98, a weight decay of 1e-3 and a batchsize of 256. The base learning rate (LR) of the backbone is 8e-6 with a cosine schedule in 10 epochs. The LRs of adapters and vision classifier are 100 and 50 base LR, respectively. ... We sample 32 frames per video and conduct inference with 1 temporal clip and 1 spatial crop (1x1 view). Downstream tasks with K400 pretrained models The few-shot and base-to-novel settings utilize a batch size of 64, a learning rate of 2e-6 with a cosine schedule in 60 epochs, and a linear warmup over first 5 epochs. We set the learning rate of adapters and vision classifier to 200 and 100 base LR, respectively. ... We use 32 sampled frames and conduct 1x1 view inference. Fully-supervised training on K400 The model is trained for 30 epochs that includes 5 linear warmup epochs with a batch size 512; the base LR is set to 2.2e-5 with a cosine schedule.