DocMamba: Efficient Document Pre-training with State Space Model

Authors: Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Doc Mamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage.
Researcher Affiliation Collaboration 1NERC-SLIP, University of Science and Technology of China 2i FLYTEK Research EMAIL, EMAIL, EMAIL
Pseudocode No The paper includes figures illustrating the framework and scan strategy (Figure 2 and Figure 3) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Pengfei-Hu/Doc Mamba
Open Datasets Yes We select several datasets to evaluate the performance of Doc Mamba, including FUNSD (Jaume, Ekenel, and Thiran 2019), CORD (Park et al. 2019), SROIE (Huang et al. 2019) and HRDoc (Ma et al. 2023). We use 10 million pages from the IIT-CDIP Test Collection 1.0 (Lewis et al. 2006), a large-scale scanned document image dataset, to pre-train Doc Mamba.
Dataset Splits Yes FUNSD. The FUNSD dataset is a noisy scanned document dataset for form understanding, containing 149 training samples and 50 testing samples. CORD. The CORD dataset is used for key information extraction from receipts, comprising 800 training samples, 100 validation samples, and 100 test samples. SROIE. The SROIE dataset is another receipt understanding dataset, consisting of 626 training receipts and 347 test receipts. HRDoc. The HRDoc dataset is designed for the hierarchical reconstruction of academic document structures. We use the HRDoc-Hard subset, which includes 1,000 training documents and 500 testing documents.
Hardware Specification Yes Pretraining is conducted on 8 Telsa A40 48GB GPUs.
Software Dependencies No The paper mentions "Paddle OCR" but does not specify a version. It does not list specific versions for other software dependencies like programming languages or libraries used for implementation.
Experiment Setup Yes Doc Mamba employs a 24-layer bidirectional Mamba encoder with a hidden size of 768 and an intermediate size of 1,536. For the SSM within each layer, we use the default hyperparameters from Mamba (Gu and Dao 2023), setting the state dimension to 16. The constant k for computing the varying batch size of a single GPU is 20,480. For example, the batch size is set to 40 for an input length of 512. For the MLM task, following the settings in BERT (Devlin et al. 2018), we randomly mask 15% of all input tokens. Out of these, 80% are replaced by [MASK], 10% are replaced by random tokens from the vocabulary, and 10% remain unchanged. Doc Mamba is pre-trained using the Adam optimizer (Kingma and Ba 2014) with a learning rate of 5 10 5 for 500,000 steps. The learning rate is warmed up over the first 10% steps and then linearly decayed. Finu-tuning. We treat FUNSD, CORD, and SROIE as sequential labeling tasks, using BIO tags for each entity field. We use the officially-provided images and OCR annotations and build a dropout layer and a linear layer above the output representations. Doc Mamba is fine-tuned on these datasets for 1,000 steps with a learning rate 2 10 5 and a batch size of 16. For HRDoc, we directly predict the categories for each unit, using a learning rate of 2 10 5, a batch size of 48 for 2,000 steps.