Compositional Condition Question Answering in Tabular Understanding
Authors: Jun-Peng Jiang, Tao Zhou, De-Chuan Zhan, Han-Jia Ye
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at https: //github.com/LAMDA-Tabular/MMTU. ... To answer the question, we conducted a preliminary investigation. ... Our analysis, shown in Figure 1, reveals that current MLLMs perform well on IE and RC tasks, which mainly require structural recognition. ... Experimental results demonstrate the effectiveness of COCOTAB in improving the model s ability in complex TQA tasks, particularly those involving compositional conditions. ... In this section, we first outline the experimental framework, providing details on the specific implementation, evaluation benchmarks, and MLLMs used for comparative assessment. Subsequently, we use tabular understanding benchmarks to conduct a comprehensive comparison of COCOTAB with state-of-the-art methods. Finally, this section summarizes the ablation study and visualizations for the tabular understanding case, highlighting COCOTAB s exceptional ability in handling compositional condition tasks. |
| Researcher Affiliation | Academia | Jun-Peng Jiang 1 2 Tao Zhou 1 2 De-Chuan Zhan 1 2 Han-Jia Ye 1 2 1School of Artificial Intelligence, Nanjing University, Nanjing, China. 2National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China. Correspondence to: Han-Jia Ye <EMAIL>. |
| Pseudocode | No | The paper describes the method and architecture visually in Figure 3 and in descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code can be available at https: //github.com/LAMDA-Tabular/MMTU. |
| Open Datasets | Yes | Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. ... For the MMTU benchmark, we curated tables from WTQ (Pasupat & Liang, 2015), Tab Fact (Chen et al., 2019), and NAT-QA, creating four QA task types across over ten domains, yielding 8921 QA pairs. ... The datasets utilized in both stages are detailed in Table 3. (Table 3 lists: LLaVA-1.5-pretrain (Liu et al., 2024b), Laion-Caption (Schuhmann et al., 2022), CC12M-Caption (Changpinyo et al., 2021), PUB-1M (Smock et al., 2022), WTQ (Pasupat & Liang, 2015), Tab Fact (Chen et al., 2019), Plot-QA (Wang et al., 2024b), OCR-VQA (Mishra et al., 2019), LLaVA-1.5-finetune (Liu et al., 2024b)) |
| Dataset Splits | No | We evaluated the aforementioned models on the widely used tabular understanding benchmark: Wiki Table Questions (WTQ) (Pasupat & Liang, 2015). In particular, we divided the test set of the entire WTQ into four categories based on the nature of the tables, including understanding individual elements (IE), interpreting rows or columns (RC), comprehending compositional conditions (CC), and performing basic calculations or reasoning (CR). For convenience in statistics, we randomly sampled 60 questions from each category and evaluated them on this smaller dataset. ... For the massive multimodal tabular understanding benchmark, denoted as MMTU, we select and clean the suitable tables from WTQ, Fina QA and Arxiv papers. Based on these tables, we constructed four types of question-answering tasks according to different understanding objectives, resulting in four question categories, six domains, and approximately 10,000 question-answer pairs. |
| Hardware Specification | Yes | The entire training process is about 5 days on the four A800 GPUs setup. |
| Software Dependencies | No | In this study, we configure COCOTAB with the pre-trained Siglip-ViT (Zhai et al., 2023) as the vision encoder and Qwen2-Instruct (Yang et al., 2024) as the backbone for LLM. While specific models for the vision encoder and LLM are mentioned, no version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA) are provided. |
| Experiment Setup | Yes | Implementation Details: In this study, we configure COCOTAB with the pre-trained Siglip-ViT (Zhai et al., 2023) as the vision encoder and Qwen2-Instruct (Yang et al., 2024) as the backbone for LLM. The initial learning rates for the two stages are set as 2e-4 and 2e-6, respectively, with the batch size of 64 and 32. The learning rate for the vision encoder is set as 5e-7. The entire training process is about 5 days on the four A800 GPUs setup. Additionally, BF16 and TF32 precision formats are employed to balance speed and accuracy throughout the training process meticulously. As shown in Figure 3, we set three projections for visual patches, row patches, and column patches respectively. ... Table 4. Training hyperparameters. Config Stage 1 Stage 2 MLP expert network 2 Linear layers with Si LU Deepspeed Zero3 Zero3 Image resolution 384 384 Image encoder siglip-so400m-patch14-384 Feature select layer -2 Image projector 2 Linear layers with Ge LU Epoch 1 2 Optimizer AdamW Learning rate 2e-4 2e-6 Learning rate Vision 5e-7 Learning rate scheduler Cosine Weight decay 0.0 Text max length 8096 2048 Batch size per GPU 16 8 GPU 4 A800-80G Precision Bf16 Gradient checkpoint True |