reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Debiased Multimodal Understanding for Human Language Sequences

Authors: Zhi Xu, Dingkang Yang, Mingcheng Li, Yuzheng Wang, Zhaoyu Chen, Jiawei Chen, Jinjie Wei, Lihua Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module. Extensive experiments are conducted on three mainstream MLU benchmarks.
Researcher Affiliation	Academia	1Academy for Engineering and Technology, Fudan University, Shanghai, China. EMAIL
Pseudocode	No	The paper describes its methodology in prose and mathematical formulations. There are no clearly labeled pseudocode blocks or algorithms found.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	Extensive experiments are conducted on three mainstream MLU benchmarks. Concretely, MOSI (Zadeh et al. 2016) is a multimodal human sentiment analysis dataset consisting of 2,199 video segments. MOSEI (Zadeh and Pu 2018) is a large-scale human sentiment and emotion recognition benchmark containing 22,856 video clips from 1,000 different subjects and 250 diverse topics. UR FUNNY (Hasan et al. 2019) is a multimodal human humor detection dataset that contains 16,514 video clips from 1,741 subjects collected by the TED portal.
Dataset Splits	Yes	Concretely, MOSI (Zadeh et al. 2016) is a multimodal human sentiment analysis dataset consisting of 2,199 video segments. The standard data partitioning is 1,284 samples for training, 284 samples for validation, and 686 samples for testing. These samples contain a total of 89 distinct subjects from video blogs. Each sample is manually annotated with a sentiment score ranging from -3 to 3. MOSEI (Zadeh and Pu 2018) is a large-scale human sentiment and emotion recognition benchmark containing 22,856 video clips from 1,000 different subjects and 250 diverse topics. Among these samples, 16,326, 1,871, and 4,659 samples are used as training, validation and testing sets. UR FUNNY (Hasan et al. 2019) is a multimodal human humor detection dataset that contains 16,514 video clips from 1,741 subjects collected by the TED portal. There are 10,598, 2,626, and 3,290 samples in the training, validation, and testing sets.
Hardware Specification	Yes	We implement the selected methods and Su CI on NVIDIA Tesla A800 GPUs utilizing the PyTorch toolbox, where other training settings are aligned to their original protocols.
Software Dependencies	No	The text feature extractor is instantiated by pre-trained Glove word embedding tool (Pennington, Socher, and Manning 2014) to obtain 300-dimensional linguistic vectors. For MOSI & MOSEI, we use the library Facet (i Motions 2017) to extract an ensemble of visual features... Meanwhile, Openface (Baltruˇsaitis, Robinson, and Morency 2016) is utilized on UR FUNNY... The audio feature extraction is executed utilizing the software COVAREP (Degottex et al. 2014)... We implement the selected methods and Su CI on NVIDIA Tesla A800 GPUs utilizing the PyTorch toolbox...
Experiment Setup	Yes	In the Su CI implementation, the hidden dimensions dh and dn are set to 64 and 128, respectively. The size ds of each subject confounder is 325, 409, and 456 on MOSI, MOSEI, and UR FUNNY, respectively.