WDMIR: Wavelet-Driven Multimodal Intent Recognition

Authors: Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on MInt Rec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues. Our experiments significantly improved each metric on MInt Rec and MELD-DA, validating the validity and generalizability of the method.
Researcher Affiliation Academia 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2School of Computer Science, Liupanshui Normal University 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL, EMAIL, EMAIL. All email domains (.edu.cn) and institutional names indicate academic or public research institutions.
Pseudocode No The paper describes the methods using mathematical equations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on two datasets, MInt Rec [Zhang et al., 2022a] and MELD-DA [Saha et al., 2020].
Dataset Splits Yes MInt Rec is a multimodal intent dataset containing text, video, and audio, with 2224 samples and 20 intent categories. It includes 1334, 445, and 445 samples for training, validation, and testing, respectively. MELD-DA is a multi-round emotion conversation dataset containing text, video, and audio, with 9988 samples and 12 emotion conversation behavior labels. It includes 6991, 999, and 1998 samples for training, validation, and testing, respectively.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run the experiments. It only mentions using pre-trained models and an optimizer.
Software Dependencies No The paper mentions specific pre-trained models (bert-base-uncased, wav2vec2-base-960h, Swin-Transformer) and libraries (Torchvision) along with the Adam optimizer, but it does not provide specific version numbers for these software components (e.g., Python, PyTorch/TensorFlow, Torchvision).
Experiment Setup Yes Adam [Loshchilov, 2017] as an optimization parameter throughout the experiment. The training batch size is 16, and the validation and test batch sizes are both 8.