InstructOCR: Instruction Boosting Scene Text Spotting
Authors: Chen Duan, Qianyi Jiang, Pei Fu, Jiamin Chen, Shengxi Li, Zining Wang, Shan Guo, Junfeng Luo
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. |
| Researcher Affiliation | Collaboration | Chen Duan1, Qianyi Jiang1, Pei Fu1*, Jiamin Chen2, Shengxi Li1, Zining Wang1, Shan Guo1, Junfeng Luo1 1Meituan 2Xi an Jiaotong University EMAIL {chenjm}@stu.xjtu.edu.cn |
| Pseudocode | No | The paper describes the architecture (Image Encoder, Text Encoder, Decoder) and instruction generation process in detail, but it does not include any structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code https://github.com/Chen D-VL/Instruct OCR |
| Open Datasets | Yes | In our experiments, we evaluate our method on Total-Text (Ch ng and Chan 2017), ICDAR2015 (Karatzas et al. 2015), and ICDAR2013 (Karatzas, Shafait et al. 2013). VQA for Scene Text. Scene text VQA involves answering questions about the natural scene images or reasoning about the scene text. Text VQA (Singh, Natarajan et al. 2019) contains 45,336 questions on 28,408 images that require reasoning about text to answer. ST-VQA comprises 23, 038 images sourced from a combination of public datasets (Biten, Tito et al. 2019) |
| Dataset Splits | Yes | Total-Text is an arbitrarily shaped word-level scene text benchmark, with 1,255 training images and 300 testing images. ICDAR2015 contains 1,000 training images and 500 testing images for quadrilateral scene text. ICDAR2013 contains 229 training images and 233 testing images with horizontal text. |
| Hardware Specification | Yes | The entire model is distributively trained on 32 NVIDIA A100-80G GPUs. |
| Software Dependencies | No | The paper mentions using ResNet50 and BERT architectures but does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x). |
| Experiment Setup | Yes | The Transformer encoder and decoder consist of 6 layers with 8 heads. The max length of recognition queries is 25 and the maximum number of objects is 60. We use a batch size of 320, and the pretrain model is trained for 200 epochs, with an initial 5-epoch warm-up phase. We use Adam W optimizer with a learning rate of 4.4 10 4. The input image s short size is randomly resized to a range from 704 to 1024 (intervals of 32), the maximum length of image is set as 1024. Subsequently, the model is trained for another 40 epochs, with a fixed learning rate of 1 10 4, and the maximum length of image is set as 1600. Then, instructions are added, and the model is further trained for another 50 epochs. For the scene text spotting task, the model is fine-tuned on the corresponding real datasets for another 140 epochs, with a fixed learning rate of 1 10 5. For the scene text VQA task, the model is fine-tuned on the Text VQA and ST-VQA datasets for another 120 epochs. At the inference stage, we resize the image s maximum length shorter than 1920 pixels. |