reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Authors: Amir Aghdam, Vincent Tao Hu, Björn Ommer

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on the highly challenging and diverse Action Atlas Salehi et al. (2024) benchmark, our method achieves state-of-the-art performance, outperforming both CLIP-style baselines and billionparameter video language models. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image language models for ﬁne-grained video understanding.
Researcher Affiliation	Academia	Amir Aghdam EMAIL Department of Computer Science, Temple University Philadelphia, PA, USA Vincent Tao Hu EMAIL Comp Vis @ LMU Munich, Munich Center for Machine Learning Munich, Germany Bjorn Ommer EMAIL Comp Vis @ LMU Munich, Munich Center for Machine Learning Munich, Germany
Pseudocode	No	The paper describes the methodology using textual descriptions and diagrams, but does not include a distinct pseudocode block or algorithm listing.
Open Source Code	Yes	Code Link: https://amir-aghdam.github.io/act-align/
Open Datasets	Yes	To rigorously evaluate its generality, we adopt Action Atlas Salehi et al. (2024) to our knowledge, the most diverse and challenging benchmark for ﬁne-grained action recognition across various domains.
Dataset Splits	No	In our zero-shot setting, no video examples of the target classes Y are used for training or tuning; only high-level action names cj are provided. For each video Vi, we retain its multiple-choice candidate set {ci,1, . . . , ci,Mi}, and replace each class label with an LLM-generated sub-action sequence. (Dataset statistics is provided in the Appendix.)
Hardware Specification	Yes	All experiments run on a single NVIDIA RTX A5000 GPU (25 GB).
Software Dependencies	No	The paper mentions using "GPT-4o" for sub-action generation and the "Sig LIP so400m" model for feature encoding, but does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	We use Sig LIP so400m (patch size 14, d = 384, 878M parameters). We apply a moving-average smoothing with a ﬁxed window size of w = 30 frames (1s @30 fps) to reduce transient noise and emphasize consistent motion patterns. Prompt Description: Short-ﬁxed 2-word, ﬁxed 10 sub-actions; Context-rich context-rich, variable-length.