ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Authors: Amir Aghdam, Vincent Tao Hu, Björn Ommer
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated on the highly challenging and diverse Action Atlas Salehi et al. (2024) benchmark, our method achieves state-of-the-art performance, outperforming both CLIP-style baselines and billionparameter video language models. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image language models for fine-grained video understanding. |
| Researcher Affiliation | Academia | Amir Aghdam EMAIL Department of Computer Science, Temple University Philadelphia, PA, USA Vincent Tao Hu EMAIL Comp Vis @ LMU Munich, Munich Center for Machine Learning Munich, Germany Bjorn Ommer EMAIL Comp Vis @ LMU Munich, Munich Center for Machine Learning Munich, Germany |
| Pseudocode | No | The paper describes the methodology using textual descriptions and diagrams, but does not include a distinct pseudocode block or algorithm listing. |
| Open Source Code | Yes | Code Link: https://amir-aghdam.github.io/act-align/ |
| Open Datasets | Yes | To rigorously evaluate its generality, we adopt Action Atlas Salehi et al. (2024) to our knowledge, the most diverse and challenging benchmark for fine-grained action recognition across various domains. |
| Dataset Splits | No | In our zero-shot setting, no video examples of the target classes Y are used for training or tuning; only high-level action names cj are provided. For each video Vi, we retain its multiple-choice candidate set {ci,1, . . . , ci,Mi}, and replace each class label with an LLM-generated sub-action sequence. (Dataset statistics is provided in the Appendix.) |
| Hardware Specification | Yes | All experiments run on a single NVIDIA RTX A5000 GPU (25 GB). |
| Software Dependencies | No | The paper mentions using "GPT-4o" for sub-action generation and the "Sig LIP so400m" model for feature encoding, but does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA. |
| Experiment Setup | Yes | We use Sig LIP so400m (patch size 14, d = 384, 878M parameters). We apply a moving-average smoothing with a fixed window size of w = 30 frames (1s @30 fps) to reduce transient noise and emphasize consistent motion patterns. Prompt Description: Short-fixed 2-word, fixed 10 sub-actions; Context-rich context-rich, variable-length. |