Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization

Authors: Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji Watanabe

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition. Sections such as '4 Experimental Settings', '5 Experimental Results', and '5.3 Ablation Studies' further confirm the empirical nature of this work.
Researcher Affiliation Academia The authors are affiliated with 'Carnegie Mellon University' and 'Gaoling School of Artificial Intelligence, Renmin University of China', both of which are academic institutions. The email domains 'ruc.edu.cn' and 'andrew.cmu.edu' also indicate academic affiliations.
Pseudocode No The paper describes the proposed method, including the construction of bifocal preference datasets and optimization objectives, using textual descriptions and mathematical formulations (e.g., Equation 2, 3, 4). However, it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The paper provides a direct link to a GitHub repository: 'Code https://github.com/espnet/espnet.'
Open Datasets Yes The paper explicitly mentions and cites well-known public datasets: 'How2 (Sanabria et al. 2018)', 'Vis Speech (Gabeur et al. 2022)', and 'Ego4D (Grauman et al. 2022)'.
Dataset Splits Yes The paper implicitly refers to standard dataset splits by mentioning specific benchmarks and versions, such as 'we use the 300-hour version of How2', and for Ego4D, 'We use the audiovisual diarization benchmark and evaluate our model on the validation set with ground truth annotations.' This indicates the use of predefined, reproducible splits associated with these benchmarks.
Hardware Specification Yes The paper specifies: 'We conduct the training using 1 V100 GPU...' and 'Experiments of this work used the Bridges2 system at PSC and Delta system at NCSA through allocations CIS210014 and IRI120008P from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program.'
Software Dependencies No The paper mentions specific models like 'OWSM v3.1' as the backbone and 'CLIP-Large' as the visual encoder, but it does not provide specific version numbers for underlying software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes The paper provides specific hyperparameters and training configurations: 'we fine-tune OWSM v3.1 for 10 epochs with a batch size of 64. We set α to 0.3 in Equation 3. In the BPO fine-tuning stage, we set β to 0.1 in Equation 4. We conduct the training using 1 V100 GPU, with a batch size of 512 and a learning rate of 2e-6.'