AutoDocSegmenter: A Geometric Approach towards Self-Supervised Document Segmentation

Authors: Ankita Chatterjee, Anjali Raj, Soumyadeep Dey, Pratik Jawanpuria, Jayanta Mukhopadhyay, Partha Pratim Das

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach on several benchmarks, where it outperforms state-of-the-art document segmentation methods. Our code is available at https://github.com/ankitachatterjee94/Auto Doc Segmenter. We evaluate Auto Doc Segmenter on PRIm A (Antonacopoulos et al., 2009), Doc Lay Net, Pub Lay Net, and M 6-Doc (Cheng et al., 2023) datasets and show that it outperforms the existing baselines in both within domain and cross domain settings. Overall, we observe that Auto Doc Segmenter is able to generalize on complex layout images which are not observed during the training stage. In Section 4, we present a comprehensive analysis of the experiments conducted to validate our approach, along with a detailed discussion of the observed results.
Researcher Affiliation Collaboration Ankita Chatterjee EMAIL Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Anjali Raj EMAIL Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Soumyadeep Dey EMAIL Microsoft, INDIA Pratik Jawanpuria EMAIL Microsoft, INDIA Jayanta Mukherjee EMAIL Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Partha Pratim Das EMAIL Department of Computer Science Ashoka University
Pseudocode Yes The isothetic covers are generated without any backtracking and in linear time with respect to the perimeter of the polygon. Pseudo masks are generated by filling these polygons. We refer to this stage of training pipeline with the conventional isothetic covers (Biswas et al., 2010) as Auto Doc Segmenter-I. The detailed algorithm in discussed in the Appendix A.2. Algorithm 1 Merging Polygons. Algorithm 2 Path Traversal of Isothetic covers.
Open Source Code Yes Our code is available at https://github.com/ankitachatterjee94/Auto Doc Segmenter
Open Datasets Yes We evaluate our method on four popular document segmentation datasets: PRIm A (Antonacopoulos et al., 2009), Doc Lay Net (Pfitzmann et al., 2022), Pub Lay Net (Zhong et al., 2019), and M 6-Doc (Cheng et al., 2023).
Dataset Splits Yes PRIm A has 382 and 96 images for training and testing, respectively, with polygonal annotations for each entity. Doc Lay Net and Pub Lay Net have rectangular annotations and contain 69 375/6489 and 335 703/11 245 images for training/testing, respectively. M 6-Doc is a recent dataset with only a test set of 2724 images.
Hardware Specification No The paper discusses lightweight and heavyweight models for mobile applications but does not specify the hardware used for running the experiments. For example, it states: 'We report results of Auto Doc Segmenter with lightweight encoders (Mi T-B0 and Mobile Net V3-small) which offer significant advantages in terms of parameter and computational efficiency. These are important prerequisites for mobile applications (e.g., Microsoft s M365 and Office Lens apps, Adobe Scan app, etc.) that demand fast and reliable document layout analysis.' This describes target deployment, not experimental hardware.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'Otsu s thresholding technique' but does not provide specific version numbers for any libraries, programming languages, or development environments that would be needed to reproduce the experiments.
Experiment Setup Yes The model is trained with the Adam optimizer and a learning rate of 0.001 for 50 epochs, and the learning rate is lowered by a factor of 10 every 10 epochs. All document images are resized to 256 pixels. For both isothetic covers as well as the proposed modified isothetic covers (Algorithm 1), we employ Otsu s thresholding technique (Otsu, 1979) for global and local thresholding, as it can automatically find the optimal threshold that minimizes the intraclass variance of the pixel intensities. We use the mean average precision (m AP) metric to compare segmentation models. We observe that a grid size of 18 achieves the best performance for both encoders, attaining the highest Io U and m AP scores. Likewise, we find an appropriate grid size for Doc Lay Net and Pub Lay Net as 8 and 10, respectively, by following the same procedure. We set the grid size to be 1/100th of the minimum dimension of the image in all of our experiments. We observe that the default threshold value of 50% is a robust choice across datasets for polygon merging.