LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining
Authors: Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks. |
| Researcher Affiliation | Academia | Huawen Shen1,3, Gengluo Li1,3, Jinwen Zhong1*, Yu Zhou2* 1Institute of Information Engineering, Chinese Academy of Sciences 2VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3School of Cyber Security, University of Chinese Academy of Sciences EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed LDM framework and its components (MTIM, LKI) through textual explanations and an overall illustration in Figure 4. However, it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described. It mentions using third-party tools like 'Any Text' (Tuo et al. 2024) and 'ESP' (Yang et al. 2023) but does not provide access to their own implementation. |
| Open Datasets | Yes | We use Doc Bank (Li et al. 2020) and RVL-CDIP (Harley, Ufkes, and Derpanis 2015) to pre-train our model. ... FUNSD (Jaume, Ekenel, and Thiran 2019) is a well-annotated English dataset ... XFUND (Xu et al. 2022) is a multilingual extension of FUNSD ... SIBR (Yang et al. 2023) is a bilingual dataset ... CORD (Park et al. 2019) is an English dataset... |
| Dataset Splits | Yes | FUNSD ... containing 149 training examples and 50 testing samples. ... XFUND ... with 149 training samples and 50 testing samples for each language. ... SIBR ... includes 600 training samples and 400 testing samples... CORD ... It contains 800 training samples, 100 validation samples, and 100 testing samples. |
| Hardware Specification | Yes | The LDM model is pre-trained for 10 epochs and fine-tuned for 2000 steps using 8 NVIDIA A6000 48GB GPUs. |
| Software Dependencies | No | The LDM model is built using the Py Torch framework and the Hugging Face Transformers library. We adhere to all preprocessing steps and pre-trained parameters from SAMBASE, except for the prediction head. All other parameters are randomly initialized. |
| Experiment Setup | Yes | The LDM model is trained using the Adam optimizer with a learning rate of 2e-4. The learning rate is linearly warmed up for the first 10% of steps, followed by cosine decay. The training batch size is set to 32. ... During pre-training, the number of bounding boxes is truncated to 512, while in fine-tuning, all bounding boxes are retained. |