Debiased Multimodal Understanding for Human Language Sequences

Authors: Zhi Xu, Dingkang Yang, Mingcheng Li, Yuzheng Wang, Zhaoyu Chen, Jiawei Chen, Jinjie Wei, Lihua Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module. Extensive experiments are conducted on three mainstream MLU benchmarks.
Researcher Affiliation Academia 1Academy for Engineering and Technology, Fudan University, Shanghai, China. EMAIL
Pseudocode No The paper describes its methodology in prose and mathematical formulations. There are no clearly labeled pseudocode blocks or algorithms found.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Extensive experiments are conducted on three mainstream MLU benchmarks. Concretely, MOSI (Zadeh et al. 2016) is a multimodal human sentiment analysis dataset consisting of 2,199 video segments. MOSEI (Zadeh and Pu 2018) is a large-scale human sentiment and emotion recognition benchmark containing 22,856 video clips from 1,000 different subjects and 250 diverse topics. UR FUNNY (Hasan et al. 2019) is a multimodal human humor detection dataset that contains 16,514 video clips from 1,741 subjects collected by the TED portal.
Dataset Splits Yes Concretely, MOSI (Zadeh et al. 2016) is a multimodal human sentiment analysis dataset consisting of 2,199 video segments. The standard data partitioning is 1,284 samples for training, 284 samples for validation, and 686 samples for testing. These samples contain a total of 89 distinct subjects from video blogs. Each sample is manually annotated with a sentiment score ranging from -3 to 3. MOSEI (Zadeh and Pu 2018) is a large-scale human sentiment and emotion recognition benchmark containing 22,856 video clips from 1,000 different subjects and 250 diverse topics. Among these samples, 16,326, 1,871, and 4,659 samples are used as training, validation and testing sets. UR FUNNY (Hasan et al. 2019) is a multimodal human humor detection dataset that contains 16,514 video clips from 1,741 subjects collected by the TED portal. There are 10,598, 2,626, and 3,290 samples in the training, validation, and testing sets.
Hardware Specification Yes We implement the selected methods and Su CI on NVIDIA Tesla A800 GPUs utilizing the PyTorch toolbox, where other training settings are aligned to their original protocols.
Software Dependencies No The text feature extractor is instantiated by pre-trained Glove word embedding tool (Pennington, Socher, and Manning 2014) to obtain 300-dimensional linguistic vectors. For MOSI & MOSEI, we use the library Facet (i Motions 2017) to extract an ensemble of visual features... Meanwhile, Openface (Baltruˇsaitis, Robinson, and Morency 2016) is utilized on UR FUNNY... The audio feature extraction is executed utilizing the software COVAREP (Degottex et al. 2014)... We implement the selected methods and Su CI on NVIDIA Tesla A800 GPUs utilizing the PyTorch toolbox...
Experiment Setup Yes In the Su CI implementation, the hidden dimensions dh and dn are set to 64 and 128, respectively. The size ds of each subject confounder is 325, 409, and 456 on MOSI, MOSEI, and UR FUNNY, respectively.