Graph4MM: Weaving Multimodal Learning with Structural Information
Authors: Xuying Ning, Dongqi Fu, Tianxin Wei, Wujiang Xu, Jingrui He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Meta AI 3Rutgers University. |
| Pseudocode | No | The paper describes the proposed Graph4MM framework using natural language and mathematical equations (e.g., Section 3, Equations 1-13) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https: //github.com/Yenn Ning/Graph4MM. |
| Open Datasets | Yes | For the generative task, we use WIKIWEB2M (Burns et al., 2023)... For the discriminative task, we use ELE-FASHION (Zhu et al., 2024b)... |
| Dataset Splits | Yes | Due to storage constraints, we randomly sample 10K Wikipedia pages, resulting in 13,539 section summary samples for training and 1,768 for testing. ... We sample 10k positive and negative node pairs and use an 8:1:1 train/val/test split. |
| Hardware Specification | Yes | All experiments were conducted on computing nodes equipped with 2 NVIDIA A100 or 2 NVIDIA Ada A6000 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'CLIP' for the vision encoder and 'Prefix-Tuning' for OPT-125M and 'LoRA' for LLaMA-1B, but does not provide specific version numbers for these or any other ancillary software dependencies such as programming languages or core deep learning libraries. |
| Experiment Setup | Yes | Table 5. Hyperparameter settings for generative and discriminative tasks. This table includes details such as Learning Rate (1e-4), Max Input Length (1024/512), Max Output Length (128/32), Batch Size (2), Gradient Accumulation Steps (16), LoRA Rank (64), Prefix Tuning Virtual Tokens (20), Attention Diffusion Steps (2), Number of MM-QFormer Block (1), Attention Diffusion α (0.1), Number of Attention Heads (8), and Training Epochs (50 for OPT-125M, 3 for LLaMA-1B). |