BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition
Authors: Fei Long, Xiaoou Li, Jiaming Lv, Haoyuan Yang, Xianjun Cheng, Peihua Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first describe the experimental setup (5.1). Then we compare to state-of-the-art methods in light of performance (5.2) and cost (5.3). We finally conduct ablation study (5.4) and provide qualitative analysis (5.5). Our method achieves strong performance across a range of video recognition tasks, including zero-shot, few-shot, base-to-novel, and fully supervised recognition, demonstrating its ability to capture subtle spatio-temporal cues critical for video action understanding. To facilitate fast ablation, we pretrain on K400tiny (Rasheed et al., 2023) where each class has 100 training videos, and evaluate zero-shot recognition on K600 and K9shot (K = 2, 16) recognition on HMDB-51 and SSv2. |
| Researcher Affiliation | Academia | 1School of Information and Communication Engineering, Dalian University of Technology, Dalian, China. 2School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China. |
| Pseudocode | No | The paper describes its methodology using mathematical equations and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct experiments on five widely used action recognition datasets, i.e., Kinetics-400 (K400) (Carreira & Zisserman, 2017), Kinetics-600 (K600) (Carreira et al., 2018), HMDB-51 (Kuehne et al., 2011), UCF-101 (Soomro et al., 2012) and Something Something V2 (SSv2) (Goyal et al., 2017). |
| Dataset Splits | Yes | We conduct experiments on five widely used action recognition datasets, i.e., Kinetics-400 (K400)... It provides approximately 240K training videos, 20K validation videos and 40K test videos... HMDB-51... Three training/validation splits are predefined, each of which contains 3,570 training and 1,530 validation videos. UCF-101... It defines three splits for training and validation, each containing roughly 9.5K training and 3.7K validation videos. Something-Something v2 (SSv2)... providing about 168K training and 24K validation videos. For Zero-shot recognition... On HMDB-51 and UCF-101, evaluations are conducted on the three official test splits. On K600... we use three splits randomly selected... For Few-shot recognition... We randomly sample K videos per category for training, while testing on the first validation split for HMDB-51 and UCF-101, along with the full validation split for SSv2. For Base-to-novel generalization... Three training splits per dataset are constructed. On HMDB-51 and UCF-101, only the first training split is used for training and validation; on K400 and SSv2, evaluations are performed on the full validation set. |
| Hardware Specification | Yes | All experiments are conducted using GeForce RTX 4090 GPUs with the PyTorch framework. |
| Software Dependencies | No | The paper mentions using the 'PyTorch framework' but does not specify a version number or other key software dependencies with their versions. |
| Experiment Setup | Yes | Pretraining on K400 For zero-shot pretraining, we use the AdamW optimizer with β1 = 0.9, β2 = 0.98, a weight decay of 1e-3 and a batchsize of 256. The base learning rate (LR) of the backbone is 8e-6 with a cosine schedule in 10 epochs. The LRs of adapters and vision classifier are 100 and 50 base LR, respectively. ... We sample 32 frames per video and conduct inference with 1 temporal clip and 1 spatial crop (1x1 view). Downstream tasks with K400 pretrained models The few-shot and base-to-novel settings utilize a batch size of 64, a learning rate of 2e-6 with a cosine schedule in 60 epochs, and a linear warmup over first 5 epochs. We set the learning rate of adapters and vision classifier to 200 and 100 base LR, respectively. ... We use 32 sampled frames and conduct 1x1 view inference. Fully-supervised training on K400 The model is trained for 30 epochs that includes 5 linear warmup epochs with a batch size 512; the base LR is set to 2.2e-5 with a cosine schedule. |