GLOMA: Global Video Text Spotting with Morphological Association
Authors: Han Wang, Yanjie Wang, Yang Li, Can Huang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To prove the effectiveness of the proposed method, we conduct extensive experiments on several datasets and achieve state-of-the-art performance. On ICDAR2015 video Karatzas et al. (2015) dataset, our GLOMA obtains 56.0 MOTA on the test split, with 4.6 absolute improvement compared with the previous SOTA method Wu et al. (2022b), and outperforms the previous Transformer-based method Wu et al. (2022a) by 8.3 MOTA. On the ICDAR2013 Karatzas et al. (2013) video and Minetto Minetto et al. (2011) datasets, our GLOMA also reaches leading performance. Our GLOMA can run at around 20 FPS and the global association procedure takes 3.6 ms per frame on a single Tesla V100 GPU. |
| Researcher Affiliation | Industry | Han Wang Bytedance Yanjie Wang Bytedance Yang Li Bytedance Can Huang Bytedance |
| Pseudocode | No | The paper describes methods and procedures using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about making the source code available, nor does it provide a link to a code repository in the main text or supplementary materials. |
| Open Datasets | Yes | We conduct extensive experiments on several datasets... On ICDAR2015 video Karatzas et al. (2015) dataset... On the ICDAR2013 Karatzas et al. (2013) video and Minetto Minetto et al. (2011) datasets... We first pretrain the model on COCOText Veit et al. (2016). |
| Dataset Splits | Yes | ICDAR2015 video contains 25 clips for training and 24 clips for testing. Most scenes are street views with tens of texts in one image. ICDAR2013 video is a sub-dataset of ICDAR2015 video. Minetto is a small dataset that contains 5 videos harvested outdoors. Without a training split, it is used as a test dataset in previous methods. |
| Hardware Specification | Yes | All experiments are conducted on Tesla V100 GPUs. Our GLOMA runs at around 20 FPS on a single Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions using architectures and frameworks such as ResNet-50, FPN layers, YOLOX, and Transformer, but does not provide specific version numbers for any software libraries, programming languages, or environments. |
| Experiment Setup | Yes | The architecture of the detection head is borrowed from YOLOX Ge et al. (2021) with an extra branch to regress the polygons. The tracking head is a lightweight architecture with only a one-layer Transformer. The batch size is fixed as 16 when training and random sampling within a clip is applied to make sure images in a batch are from the same video clip. For the detection head, we adopt L1 loss to regress the 4-point polygons and other losses are set the same as losses in YOLOX Ge et al. (2021). For the recognition head, we adopt Connectionist Temporal Classification (CTC) Graves et al. (2006) loss for texts. We also apply multi-task learning losses. The whole losses are written as: ℓ= e σ1ℓdet + e σ2ℓrec + e σ3ℓtrack + σ1 + σ2 + σ3, (9) where σ1, σ2, σ3 are learnable parameters. We adopt 8 as the default sliding window size. During inference, we resize the images with the shorter side fixed and the ratio of images kept. α is a hyper-parameter (for Wasserstein distance). |