reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InstructOCR: Instruction Boosting Scene Text Spotting

Authors: Chen Duan, Qianyi Jiang, Pei Fu, Jiamin Chen, Shengxi Li, Zining Wang, Shan Guo, Junfeng Luo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks.
Researcher Affiliation	Collaboration	Chen Duan1, Qianyi Jiang1, Pei Fu1*, Jiamin Chen2, Shengxi Li1, Zining Wang1, Shan Guo1, Junfeng Luo1 1Meituan 2Xi an Jiaotong University EMAIL {chenjm}@stu.xjtu.edu.cn
Pseudocode	No	The paper describes the architecture (Image Encoder, Text Encoder, Decoder) and instruction generation process in detail, but it does not include any structured pseudocode blocks or algorithms.
Open Source Code	Yes	Code https://github.com/Chen D-VL/Instruct OCR
Open Datasets	Yes	In our experiments, we evaluate our method on Total-Text (Ch ng and Chan 2017), ICDAR2015 (Karatzas et al. 2015), and ICDAR2013 (Karatzas, Shafait et al. 2013). VQA for Scene Text. Scene text VQA involves answering questions about the natural scene images or reasoning about the scene text. Text VQA (Singh, Natarajan et al. 2019) contains 45,336 questions on 28,408 images that require reasoning about text to answer. ST-VQA comprises 23, 038 images sourced from a combination of public datasets (Biten, Tito et al. 2019)
Dataset Splits	Yes	Total-Text is an arbitrarily shaped word-level scene text benchmark, with 1,255 training images and 300 testing images. ICDAR2015 contains 1,000 training images and 500 testing images for quadrilateral scene text. ICDAR2013 contains 229 training images and 233 testing images with horizontal text.
Hardware Specification	Yes	The entire model is distributively trained on 32 NVIDIA A100-80G GPUs.
Software Dependencies	No	The paper mentions using ResNet50 and BERT architectures but does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x).
Experiment Setup	Yes	The Transformer encoder and decoder consist of 6 layers with 8 heads. The max length of recognition queries is 25 and the maximum number of objects is 60. We use a batch size of 320, and the pretrain model is trained for 200 epochs, with an initial 5-epoch warm-up phase. We use Adam W optimizer with a learning rate of 4.4 10 4. The input image s short size is randomly resized to a range from 704 to 1024 (intervals of 32), the maximum length of image is set as 1024. Subsequently, the model is trained for another 40 epochs, with a fixed learning rate of 1 10 4, and the maximum length of image is set as 1600. Then, instructions are added, and the model is further trained for another 50 epochs. For the scene text spotting task, the model is fine-tuned on the corresponding real datasets for another 140 epochs, with a fixed learning rate of 1 10 5. For the scene text VQA task, the model is fine-tuned on the Text VQA and ST-VQA datasets for another 120 epochs. At the inference stage, we resize the image s maximum length shorter than 1920 pixels.