MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading

Authors: Wenhao Zhang, Jun Wang, Yong Luo, Lei Yu, Wei Yu, Zheng He, Jialie Shen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our method outperforms both the event-based and videobased lip-reading counterparts. We verify the effectiveness of our proposed method on the DVS-Lip (Tan et al. 2022) dataset, and the experiments demonstrate that our model can significantly outperform the state-of-the-arts in the field of event-based lip-reading recognition. For example, we obtain a 4.1% relative improvement in terms of overall accuracy compared the most competitive counterpart. In addition, we conducted experiments on the DVS128-Gait-Day dataset, and the experimental results proved that our model has good generalization performance. Our experimental results, along with those of other methods, are presented in Table 1. The table illustrates that our Multi-view Temporal Granularity Aligned Aggregation significantly outperforms other action and object recognition methods in lip-reading tasks. Ablation Studies Effects of Branch. We compared our fused model with individual branches in Table 3. Effects of Fusion Methods. We further explored the impact of different fusion methods. Table 4 displays our experimental results. Effects of Temperal Aggregation Module. We conducted ablation experiments to verify the effectiveness by controlling whether to use position encoding and the Self-Attention module. The experimental results in Table 5 demonstrate that compared to using only Bi-GRU (M1) as the back-end network, the accuracy is improved by 0.59% with position encoding (M2), 1.31% with the Self-Attention module (M3), and 1.52% with both modules combined.
Researcher Affiliation Academia 1School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 2School of Electronics and Information, Wuhan University, China 3Department of Computer Science, City, University of London, The United Kingdom EMAIL, EMAIL
Pseudocode No The paper describes the 'algorithm of the fusion module' using mathematical equations (4) and (5) and describes steps in paragraph text, but it does not present a structured pseudocode block or a clearly labeled algorithm section.
Open Source Code Yes Code https://github.com/whu125/MTGA
Open Datasets Yes We verify the effectiveness of our proposed method on the DVS-Lip (Tan et al. 2022) dataset, and the experiments demonstrate that our model can significantly outperform the state-of-the-arts in the field of event-based lip-reading recognition. In addition, we conducted experiments on the DVS128-Gait-Day dataset. The DVS128-Gait-Day(Wang et al. 2021) dataset contains 4,000 gait event samples from 20 volunteers
Dataset Splits No The paper mentions using the DVS-Lip dataset and the DVS128-Gait-Day dataset, and refers to a 'DVS-Lip test set' in Table 1. It states 'Since we utilized the same dataset and evaluation method as MSTP(Tan et al. 2022)', implying existing splits were used. However, it does not explicitly provide specific percentages, sample counts, or detailed methodology for training, validation, or test splits for either dataset in the main text.
Hardware Specification No The paper mentions that 'The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University,' but it does not provide specific details such as GPU/CPU models, memory, or other precise hardware specifications.
Software Dependencies No The paper describes the methodology and model architecture but does not list any specific software dependencies or libraries with their version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) that would be needed to replicate the experiments.
Experiment Setup No The paper describes the model architecture and ablation studies but does not provide concrete hyperparameter values such as learning rate, batch size, optimizer settings, specific number of training epochs, or other detailed system-level training configurations needed to reproduce the experiments. It only mentions 'under the same number of training epochs' when comparing models on DVS128-Gait-Day.