The Role of Video Generation in Enhancing Data-Limited Action Understanding
Authors: Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. We conducted extensive experiments on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition. 4 Experiment 4.1 Implementation Details 4.2 Main Results Zero-shot Action Recognition ... 4.3 Ablation Studies |
| Researcher Affiliation | Academia | 1Institute of Information Engineering, Chinese Academy of Sciences 2 VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3 School of Cyber Security, University of Chinese Academy of Sciences 4Queen Mary University of London |
| Pseudocode | No | The paper describes the methods in text and uses flowcharts/diagrams in Figure 2 to illustrate the process, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We conducted extensive experiments on four datasets (Kinetics-600, UCF-101, HMDB-51, UCF-Crime) |
| Dataset Splits | Yes | We conducted the few-shot action recognition experiments with the UCF-101 and HMDB-51 datasets in Table 2. We first pre-train the model with generated samples and then fine-tune it on each dataset with only K samples per category, where K is in 2, 4, 8 and 16. In each dataset, 16 samples per category are selected from half of the classes to construct the base split for training, while the remaining half of the categories serve as the novel split for evaluation. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific models and APIs (Cog Video X-2B, GPT-4o, TC-CLIP, X-CLIP-B/32) but does not provide specific version numbers for underlying software libraries, frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) or other ancillary software dependencies. |
| Experiment Setup | Yes | For each dataset, we generate 128 videos for each category with 50 inference steps. We set the w to 0.3 in uncertainty-based label smoothing. For tasks where real samples are not available such as zero-shot, we train the model with synthetic samples only. For tasks where real samples are available such as few-shot, long-tail, etc., we pre-train the model with synthetic samples and then fine-tune with the real samples. |