Cross-modal Multi-task Learning for Multimedia Event Extraction
Authors: Jianwei Cao, Yanli Hu, Zhen Tan, Xiang Zhao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the Multi Media Event Extraction benchmark M2E2, experimental results show that X-MTL surpasses the current stateof-the-art (SOTA) methods by 4.1% for multimedia event mention and 8.2% for multimedia argument role. |
| Researcher Affiliation | Academia | Jianwei Cao1, Yanli Hu1*, Zhen Tan1, Xiang Zhao2 1National Key Laboratory of Information Systems Engineering, National University of Defense Technology, China 2Laboratory for Big Data and Decision, National University of Defense Technology, China EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations (e.g., equations 1-14) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Following previous work (Li et al. 2020), we use the ACE 2005, im Situ, and VOA Caption datasets to train the model, and evaluate its performance on the M2E2 benchmark. ACE 2005 is an event extraction dataset comprising 15,789 sentences, covering 33 event types and 36 semantic roles. im Situ is a situation recognition dataset comprising 126,102 images, covering 504 activity verbs and 1,788 semantic roles. We additionally use grounding object information from SWi G (Pratt et al. 2020) for training. VOA Caption dataset is an unlabeled image-text pair dataset comprising 123,078 image-text pairs. M2E2 is a multimedia event extraction benchmark comprising 6,167 sentences and 1,014 images from 245 multimedia documents. |
| Dataset Splits | No | The paper mentions the total sizes of datasets like ACE 2005, im Situ, VOA Caption, and M2E2, and states they are used for training or evaluation, but does not explicitly provide specific training, validation, or test splits (percentages or sample counts) for its experiments. It refers to 'Following previous work (Li et al. 2020)' for the experimental setup but does not detail the splits. |
| Hardware Specification | Yes | All the experiments are conducted on NVIDIA RTX 4090 GPU using the Py Torch framework. |
| Software Dependencies | No | The paper mentions 'Py Torch framework', 'YOLOv8', 'BERT model (bert-base-uncased)', and 'CLIP model (clip-vit-base-patch32)', but does not provide specific version numbers for the PyTorch framework or YOLOv8, which are key software dependencies. |
| Experiment Setup | Yes | For fair comparison, we adopt the experimental setup used in previous work (Du et al. 2023). For the backbone models, we use the BERT model (bert-base-uncased) to initialize the parameters of the text encoder, and the visual transformer of the CLIP model (clip-vit-base-patch32) to initialize the parameters of the visual encoder. To detect objects for visual argument extraction, we leverage the pretrained YOLOv8 (Varghese and M. 2024) as the object detector and remove detection results with confidence below 0.8. The number of layers in the modality-shared encoder is set to 2 with a hidden layer size of 1024. We select pseudo label with confidence greater than 0.8. During training, the temperature factor for dynamic weight adjustment is 0.5; The initial weight for pseudo label tasks is set to 0.5, while the other tasks are 1.0. The maximum text input length is 300, which covers almost all input text lengths. The threshold for event coreference in CLIP scores is 0.2. All the experiments are conducted on NVIDIA RTX 4090 GPU using the Py Torch framework. We implement the Adam W optimizer to minimize the loss function. The learning rate for the text encoder is set to 1e-4, the visual encoder is set to 1e-5, and other parameters are set to 1e-3. The mini-batch size is 64, sampled proportionally according to the dataset size. We train each model for 10 epochs to obtain the final results. |