Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks

Authors: Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multiimage evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M training instances, all of which are fully open-sourced, demonstrating both high efficiency and effectiveness compared to models trained on large-scale in-house data. Our code and data are available at https://github.com/tencent-ailab/Leopard.
Researcher Affiliation Collaboration 1University of Notre Dame 2Tencent AI Seattle Lab 3UIUC
Pseudocode No The paper describes methods in paragraph text and presents a model pipeline diagram in Figure 2, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/tencent-ailab/Leopard.
Open Datasets Yes We curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. [...] Our code and data are available at https://github.com/tencent-ailab/Leopard. [...] To train Leopard, we create a large-scale instruction-tuning dataset named Leopard-instruct, comprising 925K instances, with 739K specifically tailored for text-rich, multi-image scenarios. Table 1 lists the composition of our data, with a detailed breakdown in the Appendix A.1. [...] We include public multi-page document datasets (Tito et al., 2022; Landeghem et al., 2023; Zhu et al., 2022), covering a variety of document types such as scanned handwriting, printed documents, and digital PDFs. [...] Table 8 provides a detailed breakdown of the composition of the Leopard-instruct dataset. This table includes the name, domain, and sample size of sub-datasets.
Dataset Splits No The paper mentions training on Leopard-instruct and evaluating on various benchmarks (MVQAD, DUDE, Slide VQA, etc.) but does not provide specific train/validation/test splits for its own dataset or how data was partitioned for the benchmarks used.
Hardware Specification Yes We train both Leopard-LLa VA and Leopard-Idefics2 on 64 A100-40G GPUs with a global batch size of 128.
Software Dependencies No The paper mentions using GPT-4o, LLaMA3.1, SigLIP-SO-400M, and AdamW optimizer, but does not provide specific version numbers for these software components or any other libraries/frameworks like PyTorch or Python.
Experiment Setup Yes We train both Leopard-LLa VA and Leopard-Idefics2 on 64 A100-40G GPUs with a global batch size of 128. We use the Adam W optimizer with β1 = 0.9, β2 = 0.999. Following (Jiang et al., 2024), we use a learning rate of 1 × 10−5 for Leopard-LLa VA and 5 × 10−6 for Leopard-Idefics2 to protect its pretrian knowledge. We use a cosine learning rate scheduler with a linear learning rate warm-up for the first 3% steps. All model variants are trained 1 epoch under the same hyperparameters.