Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Authors: Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Pengfei Zhu, Qinghua Hu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three crowd datasets and our Drone Bird validate our superiority against the counterparts. The code and dataset are available 1. |
| Researcher Affiliation | Academia | Bing Cao, Quanhao Lu, Jiekang Feng, Qilong Wang, Qinghua Hu, Pengfei Zhu Tianjin University {caobing,luquanhao,fengjiekang,qlwang,huqinghua,zhupengfei} @tju.edu.cn |
| Pseudocode | Yes | Algorithm 1 Framework Workflow in Training Phase Algorithm 2 DEMO workflow in training phase |
| Open Source Code | Yes | The code and dataset are available 1. 1https://github.com/mast1ren/E-MAC |
| Open Datasets | Yes | We first propose a large video bird counting dataset, Drone Bird, in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our Drone Bird validate our superiority against the counterparts. The code and dataset are available 1. 1https://github.com/mast1ren/E-MAC We conduct experiments on our Drone Bird dataset and three video object counting datasets: Fudan-Shanghai Tech (FDST) (Fang et al., 2019), Mall (Loy et al., 2013) and VSCrowd (Li et al., 2022) datasets. |
| Dataset Splits | Yes | We cut the 40 videos in the train and test sets to 500 frames per video (around 17s), and cut the 10 videos in the validate set to 150 frames per video (around 5s) to accomplish a reasonable data division. The train set, test set and validate set after the division is completed contain 15, 000 frames, 5, 000 frames and 1, 500 frames, respectively. For the Mall dataset, we follow the previous works (Bai & Chan, 2021; Hossain et al., 2020) for a fair comparison. The model is trained with the first 800 frames of the Mall dataset, and the rest 1, 200 frames are used as the test set. |
| Hardware Specification | Yes | Our experiments are conducted on Huawei Atlas 800 Training Server with CANN and NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions 'Huawei Atlas 800 Training Server with CANN' and 'NVIDIA RTX 3090 GPU'. While CANN is a software stack, no specific version numbers for CANN or any other key software libraries/frameworks (like Python, PyTorch, CUDA, etc.) are provided. |
| Experiment Setup | Yes | For hyperparameter settings, the model employs a linear learning rate warm-up for the first 15 epochs, followed by a cosine decay learning rate. The weight decay of Adam W is set to 0.05, and layer decay is set to 0.75 for the encoder. The mask ratio is 0.72. ... The probability P for spatial adaptive masking is set to 0.2. The trade-off parameters λ1, λ2, λ3, λ4 are set to 10, 10, 1, and 20, respectively. The input images are set to the size of 448 640 and the batch size is set to 3. We construct the same architecture as that in comparison experiments and trained for 200 epochs. |