RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

Authors: Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various multimodal scenarios validate the effectiveness of Rolling Q and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers.
Researcher Affiliation Collaboration 1Beihang University, Beijing, China 2Gaoling School of Artificial Intelligence Renmin University of China, Beiiing, China 3Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China 4Engineering Research Center of Next-Generation Intelligent Search and Recommendation. MOE, Beijing, China 5Xiamen University, Xiamen, China 6Tencent, Shenzhen, China. Correspondence to: Di Hu <EMAIL>.
Pseudocode Yes Algorithm 1 Rolling Query Algorithm.
Open Source Code Yes The source code is available at Github.
Open Datasets Yes CREMA-D (Cao et al., 2014) is an audio-visual dataset designed for emotion recognition... Kinetic-Sound (Arandjelovic & Zisserman, 2017) consists of 31 human action classes... CMU-MOSEI (Zadeh et al., 2018) is a multimodal dataset that integrates audio, visual, and textual data.
Dataset Splits Yes For CMU-MOSEI (Zadeh et al., 2018), we adopt a 4-layer vanilla Transformer following the preprocessing and settings outlined in Liang et al. (2021). For the CREMA-D and Kinetic-Sound datasets, we use a 4-layer Vi T-B/16 (Dosovitskiy, 2020) as the backbone, initializing it with pre-trained weights from Image Net21k (Ridnik et al., 2021).
Hardware Specification No This work is also supported by Public Computing Cloud, Renmin University of China, and fund for building world-class universities (disciplines) of Renmin University of China.
Software Dependencies No For the Kinetic Sound datasets, which consist of 10-second video clips, we extract frames at 1 fps and uniformly sample 3 frames per clip as visual inputs. The audio data is transformed into spectrograms of size 257 1,004 using librosa, with a window length of 512 and an overlap of 353. For CREMA-D, which contains shorter clips, we extract 1 frame per clip and process the audio into spectrograms of size 257 299, maintaining the same window length and overlap. This approach ensures consistency across datasets while leveraging the strengths of Vi T for both visual and audio modalities. As for CMU-MOSEI dataset, we follow the preprocessing and settings of (Liang et al., 2021), using the extracted feature provided by (Liang et al., 2021). For training, the batch size is set to 64 for both MOSEI, CREMA-D and Kinetic Sound. The learning rate is fixed at 1e-3, and SGD is used as the optimizer. The embedding dimensions are 120 for MOSEI and 768 for CREMA-D and Kinetic Sound, and the cosine annealing scheduler is applied across all settings. Training is initialized from scratch for MOSEI, while pretrained models are used for CREMA-D and Kinetic Sound, with all models trained for 30 epochs. This comprehensive setup ensures consistency and highlights the adaptability of the Vi T backbone to different datasets and fusion methods. The GFLOPs are obtained from the thop library.
Experiment Setup Yes As for training, we use SGD as optimizer and use cosine scheduler. The learning rate is set to 1e-3 with batch size 64 for all experiments. The embedding dimensions are 120 for MOSEI and 768 for CREMA-D and Kinetic Sound, and the cosine annealing scheduler is applied across all settings. Training is initialized from scratch for MOSEI, while pretrained models are used for CREMA-D and Kinetic Sound, with all models trained for 30 epochs.