OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Authors: Yiheng Yu, Sheng Liu, Yuan Feng, Min Xu, Zhelun Jin, Xuhua Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improve the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA. Extensive experiments show our proposed OLMD outperforms all previous models on the three widely-used datasets: PHOENIX14 (Forster et al. 2015), PHOENIX14T (Camgoz et al. 2018), and CSL-Daily (Zhou et al. 2021). Fig. 1b highlights the excellent performance of OLMD on PHOENIX14.
Researcher Affiliation Academia Zhejiang University of Technology EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and block diagrams (e.g., Figure 2 and Figure 3), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets Yes In this paper, we mainly use three large CSLR datasets: PHOENIX14 (Forster et al. 2015) is a popular CSLR dataset from German TV weather reports with 1,295 words, recorded by 9 signers. PHOENIX14-T (Camgoz et al. 2018) is an expanded version of PHOENIX-2014, offering 7,096 training, 519 development (Dev), and 642 testing (Test) videos. CSL-Daily (Zhou et al. 2021) is a large-scale Chinese sign language dataset filmed by 10 different signers, covering a variety of daily life themes with over 20,000 sentences.
Dataset Splits Yes PHOENIX14 ... It includes 5,672 training, 540 development (Dev), and 629 testing (Test) videos. PHOENIX14-T ... offering 7,096 training, 519 development (Dev), and 642 testing (Test) videos. CSL-Daily ... The dataset is split into 18,401 training samples, 1,077 development samples, and 1,176 test samples, featuring 2,000 sign language and 2,343 Chinese text vocabularies.
Hardware Specification Yes Finally, all the training and testing are completed on 1 NVIDIA A6000 GPU.
Software Dependencies No The paper mentions models like ResNet34, 1D-CNNs, Bi LSTM, and optimizers like Adam, but does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation.
Experiment Setup Yes During training, we set the batch size to 2 and the initial learning rate to 0.001, reducing to 30% at epochs 25 and 40. We default to using the Adam optimizer with a weight decay of 0.001, iterating for a total of 70 epochs. All input frames are first resized to 256x256 and then randomly cropped to 224x224 during training, with a 50% chance of horizontal flipping and a 20% probability of temporal scale adjustment. For inference, we simply use a central crop of 224x224.