reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GLOMA: Global Video Text Spotting with Morphological Association

Authors: Han Wang, Yanjie Wang, Yang Li, Can Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To prove the effectiveness of the proposed method, we conduct extensive experiments on several datasets and achieve state-of-the-art performance. On ICDAR2015 video Karatzas et al. (2015) dataset, our GLOMA obtains 56.0 MOTA on the test split, with 4.6 absolute improvement compared with the previous SOTA method Wu et al. (2022b), and outperforms the previous Transformer-based method Wu et al. (2022a) by 8.3 MOTA. On the ICDAR2013 Karatzas et al. (2013) video and Minetto Minetto et al. (2011) datasets, our GLOMA also reaches leading performance. Our GLOMA can run at around 20 FPS and the global association procedure takes 3.6 ms per frame on a single Tesla V100 GPU.
Researcher Affiliation	Industry	Han Wang Bytedance Yanjie Wang Bytedance Yang Li Bytedance Can Huang Bytedance
Pseudocode	No	The paper describes methods and procedures using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about making the source code available, nor does it provide a link to a code repository in the main text or supplementary materials.
Open Datasets	Yes	We conduct extensive experiments on several datasets... On ICDAR2015 video Karatzas et al. (2015) dataset... On the ICDAR2013 Karatzas et al. (2013) video and Minetto Minetto et al. (2011) datasets... We first pretrain the model on COCOText Veit et al. (2016).
Dataset Splits	Yes	ICDAR2015 video contains 25 clips for training and 24 clips for testing. Most scenes are street views with tens of texts in one image. ICDAR2013 video is a sub-dataset of ICDAR2015 video. Minetto is a small dataset that contains 5 videos harvested outdoors. Without a training split, it is used as a test dataset in previous methods.
Hardware Specification	Yes	All experiments are conducted on Tesla V100 GPUs. Our GLOMA runs at around 20 FPS on a single Tesla V100 GPU.
Software Dependencies	No	The paper mentions using architectures and frameworks such as ResNet-50, FPN layers, YOLOX, and Transformer, but does not provide specific version numbers for any software libraries, programming languages, or environments.
Experiment Setup	Yes	The architecture of the detection head is borrowed from YOLOX Ge et al. (2021) with an extra branch to regress the polygons. The tracking head is a lightweight architecture with only a one-layer Transformer. The batch size is fixed as 16 when training and random sampling within a clip is applied to make sure images in a batch are from the same video clip. For the detection head, we adopt L1 loss to regress the 4-point polygons and other losses are set the same as losses in YOLOX Ge et al. (2021). For the recognition head, we adopt Connectionist Temporal Classification (CTC) Graves et al. (2006) loss for texts. We also apply multi-task learning losses. The whole losses are written as: ℓ= e σ1ℓdet + e σ2ℓrec + e σ3ℓtrack + σ1 + σ2 + σ3, (9) where σ1, σ2, σ3 are learnable parameters. We adopt 8 as the default sliding window size. During inference, we resize the images with the shorter side fixed and the ratio of images kept. α is a hyper-parameter (for Wasserstein distance).