reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization

Authors: Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji Watanabe

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition. Sections such as '4 Experimental Settings', '5 Experimental Results', and '5.3 Ablation Studies' further confirm the empirical nature of this work.
Researcher Affiliation	Academia	The authors are affiliated with 'Carnegie Mellon University' and 'Gaoling School of Artificial Intelligence, Renmin University of China', both of which are academic institutions. The email domains 'ruc.edu.cn' and 'andrew.cmu.edu' also indicate academic affiliations.
Pseudocode	No	The paper describes the proposed method, including the construction of bifocal preference datasets and optimization objectives, using textual descriptions and mathematical formulations (e.g., Equation 2, 3, 4). However, it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The paper provides a direct link to a GitHub repository: 'Code https://github.com/espnet/espnet.'
Open Datasets	Yes	The paper explicitly mentions and cites well-known public datasets: 'How2 (Sanabria et al. 2018)', 'Vis Speech (Gabeur et al. 2022)', and 'Ego4D (Grauman et al. 2022)'.
Dataset Splits	Yes	The paper implicitly refers to standard dataset splits by mentioning specific benchmarks and versions, such as 'we use the 300-hour version of How2', and for Ego4D, 'We use the audiovisual diarization benchmark and evaluate our model on the validation set with ground truth annotations.' This indicates the use of predefined, reproducible splits associated with these benchmarks.
Hardware Specification	Yes	The paper specifies: 'We conduct the training using 1 V100 GPU...' and 'Experiments of this work used the Bridges2 system at PSC and Delta system at NCSA through allocations CIS210014 and IRI120008P from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program.'
Software Dependencies	No	The paper mentions specific models like 'OWSM v3.1' as the backbone and 'CLIP-Large' as the visual encoder, but it does not provide specific version numbers for underlying software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	The paper provides specific hyperparameters and training configurations: 'we fine-tune OWSM v3.1 for 10 epochs with a batch size of 64. We set α to 0.3 in Equation 3. In the BPO fine-tuning stage, we set β to 0.1 in Equation 4. We conduct the training using 1 V100 GPU, with a batch size of 512 and a learning rate of 2e-6.'