Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction

Authors: Qian Li, Cheng Ji, Shu Guo, Kun Peng, Qianren Mao, Shangguang Wang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on benchmark MMRE datasets demonstrate that VM-HAN achieves state-of-the-art performance, significantly surpassing existing methods in both accuracy and efficiency. Experimental results on benchmark datasets demonstrate the effectiveness of our approach, outperforming state-of-the-art methods. To evaluate the performance of the proposed VMHAN framework, we use two widely recognized datasets for Multi-Modal Relation Extraction (MMRE): MNRE and MORE. We compare VM-HAN against three categories of methods: text-based relation extraction (RE) models, BERT-based multi-modal RE (MMRE) models, and graph neural networks (GNNs) for the multimodal relation extraction. From Table 1, we can observe that: 1) Our model outperformed all baseline methods, confirming its ability to effectively integrate multi-modal knowledge for improved performance. 5.3 Ablation Study. 5.4 Discussions for V-HAN. 5.5 Effect of Visual Information. 5.6 Efficiency.
Researcher Affiliation Academia 1School of Computer Science, Beijing University of Posts and Telecommunications, China 2SKLCCSE, School of Computer Science and Engineering, Beihang University, China 3Zhongguancun Laboratory, China 4National Computer Network Emergency Response Technical Team & Coordination Center, China 5Institute of Information Engineering, Chinese Academy of Sciences, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository.
Open Datasets Yes To evaluate the performance of the proposed VMHAN framework, we use two widely recognized datasets for Multi-Modal Relation Extraction (MMRE): MNRE and MORE. (1) The MNRE dataset [Zheng et al., 2021b] is sourced from Twitter1. 1The direct link to the Twitter data stream provided at https://archive.org/details/twitterstream. (2)To broaden the scope of our investigation, we incorporate the MORE dataset [He et al., 2023].
Dataset Splits No The paper mentions using MNRE and MORE datasets and fine-tuning hyperparameters based on validation set performance, but it does not specify the exact percentages or counts for training, validation, and test splits, nor does it reference predefined standard splits with citations within the main text.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models, memory, or specific computing environments like cloud instances.
Software Dependencies No For text-based initialization, the textual embeddings were initialized using the bert-base-uncased model from Hugging Face2. For visual feature extraction, visual features were extracted using the VGG16 network3 and the YOLOv3 [Redmon and Farhadi, 2018]. The Adam W optimizer [Loshchilov and Hutter, 2019] was employed. These mention models or optimizers but not specific software library versions (e.g., PyTorch 1.9, TensorFlow 2.x) with their version numbers.
Experiment Setup Yes For text-based initialization, the textual embeddings were initialized using the bert-base-uncased model from Hugging Face2, with an embedding dimension of 768. Text inputs were either truncated or padded to a maximum sequence length of 128 tokens. For visual feature extraction, visual features were extracted using the VGG16 network3 and the YOLOv3 [Redmon and Farhadi, 2018], widely recognized for their performance in image feature extraction. The dimensionality of visual object features was set to 4096, and the number of objects per image was limited to three to maintain consistency and computational efficiency. The Adam W optimizer [Loshchilov and Hutter, 2019] was employed, with a learning rate of 2e-5 and a weight decay of 0.01. A dropout rate of 0.6 was applied to prevent overfitting, ensuring the model’s robustness across diverse scenarios. The training process used a batch size of 16. To fine-tune hyperparameters effectively, a grid search approach was adopted, conducting five trials to identify the optimal configuration based on the validation set performance.