reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining

Authors: Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.
Researcher Affiliation	Academia	Huawen Shen1,3, Gengluo Li1,3, Jinwen Zhong1, Yu Zhou2 1Institute of Information Engineering, Chinese Academy of Sciences 2VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3School of Cyber Security, University of Chinese Academy of Sciences EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed LDM framework and its components (MTIM, LKI) through textual explanations and an overall illustration in Figure 4. However, it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described. It mentions using third-party tools like 'Any Text' (Tuo et al. 2024) and 'ESP' (Yang et al. 2023) but does not provide access to their own implementation.
Open Datasets	Yes	We use Doc Bank (Li et al. 2020) and RVL-CDIP (Harley, Ufkes, and Derpanis 2015) to pre-train our model. ... FUNSD (Jaume, Ekenel, and Thiran 2019) is a well-annotated English dataset ... XFUND (Xu et al. 2022) is a multilingual extension of FUNSD ... SIBR (Yang et al. 2023) is a bilingual dataset ... CORD (Park et al. 2019) is an English dataset...
Dataset Splits	Yes	FUNSD ... containing 149 training examples and 50 testing samples. ... XFUND ... with 149 training samples and 50 testing samples for each language. ... SIBR ... includes 600 training samples and 400 testing samples... CORD ... It contains 800 training samples, 100 validation samples, and 100 testing samples.
Hardware Specification	Yes	The LDM model is pre-trained for 10 epochs and fine-tuned for 2000 steps using 8 NVIDIA A6000 48GB GPUs.
Software Dependencies	No	The LDM model is built using the Py Torch framework and the Hugging Face Transformers library. We adhere to all preprocessing steps and pre-trained parameters from SAMBASE, except for the prediction head. All other parameters are randomly initialized.
Experiment Setup	Yes	The LDM model is trained using the Adam optimizer with a learning rate of 2e-4. The learning rate is linearly warmed up for the first 10% of steps, followed by cosine decay. The training batch size is set to 32. ... During pre-training, the number of bounding boxes is truncated to 512, while in fine-tuning, all bounding boxes are retained.