YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Authors: Garrett Tanzer, Biao Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higherand lower-resource sign languages within You Tube-SL-25.
Researcher Affiliation Industry Garrett Tanzer Google Deep Mind Biao Zhang Google Deep Mind Correspondence to EMAIL.
Pseudocode No The paper includes 'Figure 3: Unified document-level sign-to-text training, extended for multilinguality', which is a diagram illustrating the task format and token structure, not a block of pseudocode or a clearly labeled algorithm.
Open Source Code No We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. Note that this license only applies to the video IDs and ISO 639-3 language codes, which we selected and labelled. The underlying video and caption content, as with all datasets consisting of You Tube video IDs, is subject to different licenses and should be accessed/used in accordance with the You Tube Terms of Service.
Open Datasets Yes We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. We publicly release the You Tube-SL-25 video IDs at this link.
Dataset Splits Yes For translation, the model is separately finetuned for each dataset, checkpoint selected based on BLEU on the validation set. For sign language identification, zero-shot scores mean that the model is briefly finetuned on You Tube-SL-25 rebalanced to the 4 sign languages with equal weight, and finetuned scores mean that the model is finetuned on an equally weighted mixture of the benchmarks training sets. We don t finetune on FLEURS-ASL, so the finetuned langid scores are after finetuning on How2Sign.
Hardware Specification Yes We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively (switching from pure caption-level training to 1:1 captionlevel:random clip-level training once the model appeared to have converged, then stopping again after re-convergence, both according to BLEU on the How2Sign val set, like in FLEURS-ASL [34]). Each 1k steps took about 8 minutes to train. We also pretrained an m T5 Small model for about 600k steps, which was underperforming so we didn t run the complete set of experiments for it.
Software Dependencies Yes We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets)... We tried to change the pretrained model from T5-v1.1 Small [26] to m T5 Small [37] so that languages besides English could benefit from pretraining and better tokenization, but in initial experiments m T5 took about 1/3 more steps to converge and achieved worse results.
Experiment Setup Yes We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively... We finetuned the sentence-level translation models on 16 TPUv3s with a batch size of 32 until convergence; this took about 10k steps for WMT23 SS DSGS and at most 2.5k steps for the other datasets. We finetuned the language identification models on a mixture of data for the four languages with equal weight... We used 16 TPUv3s with a batch size of 32 until convergence, with up to 3k steps.