ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, Pierre Colombo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark Vi Do Re, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings. The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages. We release Col Pali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages. Combined with a late interaction matching mechanism, Col Pali largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable. We release models, data, code and benchmarks under open licenses at https://hf.co/vidore. In our experiments (Table 2), we typically find that optimizing the ingestion pipeline yields much better performance on visually rich document retrieval than optimizing the text embedding model.
Researcher Affiliation Collaboration Manuel Faysse 1,3 Hugues Sibille 1,4 Tony Wu 1 Bilel Omrani1 Gautier Viaud1 C eline Hudelot3 Pierre Colombo2,3 1Illuin Technology 2Equall.ai 3Centrale Sup elec, Paris-Saclay 4ETH Z urich EMAIL
Pseudocode No The paper describes the methodology using mathematical formulations (Equations 1 and 2) and textual descriptions of the architecture and training process, but it does not include any clearly labeled pseudocode or algorithm blocks. The steps are explained in paragraph form.
Open Source Code Yes We release models, data, code and benchmarks under open licenses at https://hf.co/vidore. We release all resources at https://hf.co/vidore. For transparency, reproducibility and to foster future work, we release our training data, model checkpoints (adapters), entire codebase, and complete evaluation benchmark under MIT licenses as detailed in the main paper.
Open Datasets Yes To this end, we create and openly release Vi Do Re, a comprehensive benchmark to evaluate systems on page-level document retrieval with a wide coverage of domains, visual elements, and languages. We openly release the training dataset8 for reproducibility and to encourage further research. https://huggingface.co/datasets/vidore/colpali train set. We repurpose widely used visual question-answering benchmarks for retrieval tasks: for each page-question-answer triplet, we use the question as the query, and the associated page as the gold document (Table 1). These academic datasets either focus on single specific modalities (Mathew et al., 2020; 2021; Li et al., 2024) or target more varied visually rich documents (Zhu et al., 2022).
Dataset Splits Yes A validation set is created with 2% of the samples to tune hyperparameters. Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both Vi Do Re and in the train set to prevent evaluation contamination. Table 3: Details on the different splits in the dataset used to train Col Pali. The statistics of the train set are given in the following table. The creation of the train set follows the same methodology as in subsection A.2. We made sure that a PDF document cannot have pages in both the training set and the test set to prevent data leakage and that there are no duplicate documents in each split. Dataset Split Split Size Language Domain... Doc VQA 39,463 English Scanned documents from UCSF... Scrapped PDFs 45,940 English Varied PDFs from 3885 distinct URL domains.
Hardware Specification Yes Querying latencies at runtime (R3) are very good for all evaluated systems ( 22 ms on a NVIDIA L4) due to fast query encoding and cosine similarity matching. To ensure comparison fairness, the latencies of the different retrieval systems shown in Figure 2 are measured on the same g2-standard-8 GCP VM with a NVIDIA L4 GPU.
Software Dependencies No The codebase is written in Py Torch18 and leverages Hugging Face tooling for model implementations and trainers19. While PyTorch and Hugging Face are mentioned as software tools, specific version numbers for these or any other software dependencies are not provided.
Experiment Setup Yes Unless specified otherwise, we train models in bfloat16 format, use low-rank adapters (Lo RA, Hu et al. (2021)) with α = 32 and r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged adamw 8bit optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e 5 with linear decay with 2.5% warmup steps, and a batch size of 32. Hyperparameters are tuned on a validation split composed of 2% of the training dataset. We find bi-encoder methods to be more sensible to learning rate variations than late interaction-based models and achieve the best performance for all models with a learning rate of 5e 5. We experiment with Lo RA rank and α values and do not notice particular improvements past r = α = 32. Per-device batch sizes are kept small due to long sequence lengths that complicate scaling past b = 4. We simulate larger batch sizes with multi-GPU training and train with a total batch size b = 32 with no accumulation, for 1 epoch on our training set.