Graph-based Document Structure Analysis
Authors: Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose a novel graph-based Document Structure Analysis (g DSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset (Graph Doc) with 80K document images and 4.13M relation annotations, enabling training models to complete multiple tasks like reading order, hierarchical structures analysis, and complex inter-element relation inference. Furthermore, a document relation graph generator (DRGG) is proposed to address the g DSA task, which achieves performance with 57.6% at m APg@0.5 for a strong benchmark baseline on this novel task and dataset. We hope this graphical representation of document structure can mark an innovative advancement in document structure analysis and understanding. The new dataset and code will be made publicly available at Graph Doc. ... We conduct extensive experiments on the Graph Doc dataset and upstream DLA tasks, proving the effectiveness of the g DSA approach for document layout analysis. |
| Researcher Affiliation | Academia | Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming Zhang , Rainer Stiefelhagen CV:HCI Lab, Karlsruhe Institute of Technology {firstname.lastname}@kit.edu |
| Pseudocode | Yes | Algorithm 1 Relation Graph Evaluation Metrics m Rg@TR and m APg@TR for g DSA Task |
| Open Source Code | No | The new dataset and code will be made publicly available at Graph Doc. |
| Open Datasets | Yes | To address these challenges, we propose a novel task called graph-based Document Structure Analysis (g DSA), which aims to not only detect document elements but also generate spatial and logical relations in the form of a graph structure. This approach allows for a more holistic and intuitive understanding of documents, akin to how humans perceive and interpret complex layouts. For this task, we introduce the Graph Doc dataset, a large-scale relation graph-based document structure analysis dataset comprising 80,000 document images and over 4 million relation annotations. Graph Doc includes annotations for both spatial relations (Up, Down, Left, Right) and logical relations (Parent, Child, Sequence, Reference) between document components, e.g., text, table, and picture. This rich relational information enables models to perform multiple tasks like reading order prediction, hierarchical structure analysis, and complex inter-element relationship inference. ... Our Graph Doc Dataset is primarily derived from the Doc Lay Net (Pfitzmann et al., 2022) dataset, which contains over 80,000 document page images spanning a diverse array of content types, including financial reports, user manuals, scientific papers, and legal regulations. We leveraged the existing detailed annotations and the PDF files offered through Doc Lay Net Dataset, to create new annotations that focus specifically on the relations between various layout elements within the documents. Additionally, in accordance with the License CDLA 1.0, users are permitted to modify and redistribute enhanced versions of datasets based on the Doc Lay Net dataset. |
| Dataset Splits | No | All experiments were conducted using the Graph Doc dataset for both training and validation. |
| Hardware Specification | Yes | In this work, all experiments were conducted on a computing cluster node equipped with four Nvidia A100 GPUs, each with 40 GB of memory. Each node would also with 300 GB of CPU memory. |
| Software Dependencies | Yes | We implemented our method using Py Torch v1.10 and trained the model with the Adam W optimizer using a batch size of 4. |
| Experiment Setup | Yes | We implemented our method using Py Torch v1.10 and trained the model with the Adam W optimizer using a batch size of 4. The initial learning rate was set to 1 10 4, with a weight decay of 5 10 3. The Adam W hyperparameters, betas and epsilon, were configured to (0.9, 0.999) and 1 10 8, respectively. To enhance the model s robustness and accuracy, we employed a multi-scale training strategy. Specifically, the shorter side of each input image was randomly resized to one of the following lengths: 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, while ensuring that the longer side did not exceed 1333 pixels. |