Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Certification of Speaker Recognition Models to Additive Perturbations

Authors: Dmitrii Korzh, Elvir Karimov, Mikhail Pautov, Oleg Y. Rogov, Ivan Oseledets

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical claims are supported by experimental results on the Vox Celeb datasets using several well-known speaker recognition models. To the best of our knowledge, there are no previous works that present the provable robustness of speaker recognition models. We highlight this issue and provide starting baselines that others can improve in future research.
Researcher Affiliation Collaboration 1AIRI, Moscow, Russia 2Skolkovo Institute of Science and Technology, Moscow, Russia 3Moscow Technical University of Communications and Informatics, Moscow, Russia 4ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia
Pseudocode Yes Algorithm 1: Computation of the certified radius.
Open Source Code Yes Code https://github.com/AIRI-Institute/asi-certification
Open Datasets Yes For our experiments, we used the Vox Celeb1 (Nagrani, Chung, and Zisserman 2017) and Vox Celeb2 (Chung, Nagrani, and Zisserman 2018) datasets, which are standard for speaker recognition and verification tasks.
Dataset Splits Yes For the certification procedure in Algorithm 1, the default parameters are the following: standard deviation of additive noise used for smoothing σ = 10 2, the maximum number of samples to construct ˆg is set to be Nmax = 105, the confidence level α = 10 3, number of enrolled speakers is K = 1118, number of random audios used to create the speaker enrollment vector M = 5, and length of given audios is set to be 3s with sampling rate 16 k Hz, number of speakers in the test set Si is 118 (Vox Celeb2 Test).
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU/CPU models or processor types.
Software Dependencies No Experiments were conducted using various backbone embedding models: ECAPA-TDNN (Desplanques, Thienpondt, and Demuynck 2020) from the Speechbrain framework (Ravanelli et al. 2021) that utilizes Mel-Spectrogram for the frontend; the Pyannote framework (Bredin et al. 2020), which focuses on speaker diarization and utilizes the raw-waveform frontend Sinc Net. These models transform speech into vector representations of dimensions d = 192 and 512 correspondingly. While frameworks are mentioned, specific version numbers for these or other software dependencies are not provided.
Experiment Setup Yes For the certification procedure in Algorithm 1, the default parameters are the following: standard deviation of additive noise used for smoothing σ = 10 2, the maximum number of samples to construct ˆg is set to be Nmax = 105, the confidence level α = 10 3, number of enrolled speakers is K = 1118, number of random audios used to create the speaker enrollment vector M = 5, and length of given audios is set to be 3s with sampling rate 16 k Hz, number of speakers in the test set Si is 118 (Vox Celeb2 Test). For the evaluation, we considered K enrolled speakers and, for each of them, created ck Sc of M randomly sampled speaker s enrollment audios, which are presented in Se.