Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance
Authors: Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, Yu Zhou
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the Inverse Text benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7% and 2.5% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape. Extensive experiments show our proposed method outperforms Inverse Text, a specific benchmark for arbitrary reading order. Moreover, we also validate the state-of-the-art performances of LSGSpotter on arbitrarily shaped benchmarks, including 81.5% on Total-Text, and 68.9% on SCUT-CTW1500 without the help of lexicon. |
| Researcher Affiliation | Collaboration | 1Institute of Information Engineering, Chinese Academy of Science 2VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3 Shanghai Artificial Intelligence Laboratory 4School of Cyber Security, University of Chinese Academy of Sciences EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing code or links to a code repository. |
| Open Datasets | Yes | Following the settings of previous works, we pre-train our model on Synth Text-150k, MLT-2017 (Nayef et al. 2017), ICDAR2013 (Karatzas et al. 2013), ICDAR2015 (Karatzas et al. 2015), Text OCR (Singh et al. 2021) and Total-Text for 600k iterations |
| Dataset Splits | No | The paper mentions pre-training on a list of datasets and fine-tuning on the "training split of the target benchmark" but does not provide specific percentages, sample counts, or detailed splitting methodology for these datasets. |
| Hardware Specification | Yes | The entire model is trained on 4 NVIDIA RTX3090 GPUs with a batch size of 4 on the single GPU. |
| Software Dependencies | No | The paper mentions using "Res Net50 (He et al. 2016) with deformable convolution module (Dai et al. 2017) for the backbone and the 6-layer Transformer decoder" but does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | we pre-train our model on Synth Text-150k, MLT-2017 (Nayef et al. 2017), ICDAR2013 (Karatzas et al. 2013), ICDAR2015 (Karatzas et al. 2015), Text OCR (Singh et al. 2021) and Total-Text for 600k iterations, which Adam W optimizes with the learning rate of 2e-4 and the weight decay is 1e-4. After pretraining, the model is fine-tuned on the training split of the target benchmark for 200 epochs. The initial learning rate is 1e-4 and declined to 1e-5 on the 60th epoch. The entire model is trained on 4 NVIDIA RTX3090 GPUs with a batch size of 4 on the single GPU. In addition, we utilize the Res Net50 (He et al. 2016) with deformable convolution module (Dai et al. 2017) for the backbone and the 6-layer Transformer decoder for the auto-regressive stage. During the training, the short size of an input image is resized and padded to 960. Random cropping and rotating are employed for data augmentation. In the inference stage, we resize the short edge to 960 while keeping the long side shorter than 1600 with the fixed aspect ratio. |