MaskOCR: Scene Text Recognition with Masked Vision-Language Pre-training

Authors: Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of our proposed approach on Chinese and English text images through extensive experiments and detailed discussion. Our proposed method achieves state-of-the-art performance and significantly surpasses previous methods, particularly on Chinese benchmarks. ... 4 Experiment 4.1 Datasets 4.2 Implementation Details 4.3 Ablation Studies 4.4 Comparison with State-of-the-art Methods
Researcher Affiliation Industry Pengyuan Lyu, Chengquan Zhang, Shan Shan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang VIS, Baidu Inc. EMAIL EMAIL EMAIL
Pseudocode No The paper describes the model architecture and processes in text (Sections 3.1, 3.2, 3.3) and with supporting diagrams (Figure 2, 3, 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions external tools like "Text Render 1 https://github.com/oh-my-ocr/text_renderer" and "PaddlePaddle/PaddleOCR", but does not explicitly state that the authors' own implementation code for the described methodology is publicly available or provided.
Open Datasets Yes Chinese text line images. The pre-training set consists of 100 million unlabeled text line images collected from practical scenarios for visual pre-training... We collect text corpus from Chinese corpus 2, and generate 100 million images with 64 commonly used fonts using Text Render. ... We first pre-train the encoder and decoder serially on the collected real images and the synthetic images, and then finetune our model on a large-scale Chinese text image benchmark BCTR Chen et al. (2021b). ... English text word images. We follow Yang et al. (2022) and collect about 15.8 million unlabeled English word images from CC-OCR Yang et al. (2021) for visual pre-training. In addition, we also synthesize 100 million English word images for language pre-training. Similarly, we collect corpus from Wiki Text103 Merity et al. (2017) and generate synthetic images with Text Render and 10 commonly used English fonts. Following Shi et al. (2019); Yu et al. (2020); Fang et al. (2021); Wang et al. (2021c); Zhang et al. (2022), two synthetic datasets MJSynth Jaderberg et al. (2014a) and Synth Text Gupta et al. (2016) are used for the training of downstream recognition tasks. Besides, we also collect 2.78 million real labeled images from Text OCR Singh et al. (2021) and Open Images Dataset v5 3 as Yang et al. (2022).
Dataset Splits Yes We first pre-train the encoder and decoder serially on the collected real images and the synthetic images, and then finetune our model on a large-scale Chinese text image benchmark BCTR Chen et al. (2021b). BCTR consists of four subsets (scene, web, document, and handwriting) and provides 1.4 million fully labeled images in total. ... We evaluate our model on six public scene text datasets: ICDAR 2013 (IC13) Karatzas et al. (2013), Street View Text (SVT) Wang et al. (2011), IIIT5K-Words (IIIT5K) Mishra et al. (2012)), ICDAR 2015 (IC15) Karatzas et al. (2015), Street View Text-Perspective (SVTP) Phan et al. (2013), and CUTE80 (CUTE) Risnumawan et al. (2014)). ... To evaluate the six scene English text datasets, we follow Shi et al. (2019); Yu et al. (2020); Fang et al. (2021); Wang et al. (2021c); Zhang et al. (2022) and evaluate the recognition performance of our model with case-insensitive word accuracy.
Hardware Specification Yes All experiments are conducted on 8 A100 GPUs with the Vi T-B as the encoder.
Software Dependencies No The paper mentions using AdamW optimizer and cosine learning rate decay, but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes We train the encoder-decoder transformer with Adam W optimizer Loshchilov & Hutter (2019), cosine learning rate decay Ilya Loshchilov and Frank Hutter (2017), a weight decay of 0.05, a drop path ratio of 0.1, and a batch size of 512. When the model is trained from scratch, the learning rate is set to 1e-3. Otherwise, the model is optimized with an initial learning rate of 1e-4. We set the training epochs as 120 and 20 for the Chinese text line recognition model and the English word recognition model with a warm-up of 5 epochs and 0.5 epochs respectively. ... By default, the base_lr is set to 1.5e-4 with cosine learning rate decay and a 0.5 epoch warm-up. We train the encoder for 10 epochs and 30 epochs for Chinese data and English data pre-training, with the batch size being 4096. ... We pre-train the decoder for 5 epochs with a batch size of 512, an initial learning rate of 1e-4, a 0.5 epochs warmup, and a cosine learning rate decay. ... we resize the height of the input image to 32 with the aspect ratio kept and pad the width of the input images to 400. For the English word samples, we directly resize all input images to 32 x 128. We set the width of the split vertical patch to 4 for all datasets by default. During the training of downstream recognition, some data augmentations like rotation, distortion, and color jitter are also used.