Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

Authors: Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Pengfei Zhu, Qinghua Hu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three crowd datasets and our Drone Bird validate our superiority against the counterparts. The code and dataset are available 1.
Researcher Affiliation Academia Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Qinghua Hu, Pengfei Zhu Tianjin University {caobing,luquanhao,fengjiekang,qlwang,huqinghua,zhupengfei} @tju.edu.cn
Pseudocode Yes Algorithm 1 Framework Workflow in Training Phase Algorithm 2 DEMO workflow in training phase
Open Source Code Yes The code and dataset are available 1. 1https://github.com/mast1ren/E-MAC
Open Datasets Yes We first propose a large video bird counting dataset, Drone Bird, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our Drone Bird validate our superiority against the counterparts. The code and dataset are available 1. 1https://github.com/mast1ren/E-MAC We conduct experiments on our Drone Bird dataset and three video object counting datasets: Fudan-Shanghai Tech (FDST) (Fang et al., 2019), Mall (Loy et al., 2013) and VSCrowd (Li et al., 2022) datasets.
Dataset Splits Yes We cut the 40 videos in the train and test sets to 500 frames per video (around 17s), and cut the 10 videos in the validate set to 150 frames per video (around 5s) to accomplish a reasonable data division. The train set, test set and validate set after the division is completed contain 15, 000 frames, 5, 000 frames and 1, 500 frames, respectively. For the Mall dataset, we follow the previous works (Bai & Chan, 2021; Hossain et al., 2020) for a fair comparison. The model is trained with the first 800 frames of the Mall dataset, and the rest 1, 200 frames are used as the test set.
Hardware Specification Yes Our experiments are conducted on Huawei Atlas 800 Training Server with CANN and NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions 'Huawei Atlas 800 Training Server with CANN' and 'NVIDIA RTX 3090 GPU'. While CANN is a software stack, no specific version numbers for CANN or any other key software libraries/frameworks (like Python, PyTorch, CUDA, etc.) are provided.
Experiment Setup Yes For hyperparameter settings, the model employs a linear learning rate warm-up for the first 15 epochs, followed by a cosine decay learning rate. The weight decay of Adam W is set to 0.05, and layer decay is set to 0.75 for the encoder. The mask ratio is 0.72. ... The probability P for spatial adaptive masking is set to 0.2. The trade-off parameters λ1, λ2, λ3, λ4 are set to 10, 10, 1, and 20, respectively. The input images are set to the size of 448 640 and the batch size is set to 3. We construct the same architecture as that in comparison experiments and trained for 200 epochs.