reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks

Authors: Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multiimage evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M training instances, all of which are fully open-sourced, demonstrating both high efficiency and effectiveness compared to models trained on large-scale in-house data. Our code and data are available at https://github.com/tencent-ailab/Leopard.
Researcher Affiliation	Collaboration	1University of Notre Dame 2Tencent AI Seattle Lab 3UIUC
Pseudocode	No	The paper describes methods in paragraph text and presents a model pipeline diagram in Figure 2, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/tencent-ailab/Leopard.
Open Datasets	Yes	We curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. [...] Our code and data are available at https://github.com/tencent-ailab/Leopard. [...] To train Leopard, we create a large-scale instruction-tuning dataset named Leopard-instruct, comprising 925K instances, with 739K specifically tailored for text-rich, multi-image scenarios. Table 1 lists the composition of our data, with a detailed breakdown in the Appendix A.1. [...] We include public multi-page document datasets (Tito et al., 2022; Landeghem et al., 2023; Zhu et al., 2022), covering a variety of document types such as scanned handwriting, printed documents, and digital PDFs. [...] Table 8 provides a detailed breakdown of the composition of the Leopard-instruct dataset. This table includes the name, domain, and sample size of sub-datasets.
Dataset Splits	No	The paper mentions training on Leopard-instruct and evaluating on various benchmarks (MVQAD, DUDE, Slide VQA, etc.) but does not provide specific train/validation/test splits for its own dataset or how data was partitioned for the benchmarks used.
Hardware Specification	Yes	We train both Leopard-LLa VA and Leopard-Idefics2 on 64 A100-40G GPUs with a global batch size of 128.
Software Dependencies	No	The paper mentions using GPT-4o, LLaMA3.1, SigLIP-SO-400M, and AdamW optimizer, but does not provide specific version numbers for these software components or any other libraries/frameworks like PyTorch or Python.
Experiment Setup	Yes	We train both Leopard-LLa VA and Leopard-Idefics2 on 64 A100-40G GPUs with a global batch size of 128. We use the Adam W optimizer with β1 = 0.9, β2 = 0.999. Following (Jiang et al., 2024), we use a learning rate of 1 × 10−5 for Leopard-LLa VA and 5 × 10−6 for Leopard-Idefics2 to protect its pretrian knowledge. We use a cosine learning rate scheduler with a linear learning rate warm-up for the first 3% steps. All model variants are trained 1 epoch under the same hyperparameters.