Student: Gemma Meseguer Castillo
Advisors: Christoforos Souganidis, Eva Navas Cordón and Inma Hernáez Rioja
Thesis Defense Date: 03/07/2025
Within the field of Speech Processing, the task of Speaker Diarization has gained a more important role in recent years. Frameworks such as NVIDIA NeMo offer configurations for Speaker Diarization focused on domains such as telephone conversations or recorded meetings. However, the broadcast domain is not as well supported.
This work focuses on whether NVIDIA NeMo can be effectively used for Broadcast Speaker Diarization by exploring in particular the modules of the Voice Activity Detection (VAD) and Multi-scale Diarization Decoder (MSDD) models; and what factors can influence the performance of this system.
Five experiments have been performed combining all possibilities of VAD models and diarizers under two configurations: general and telephonic. The best result using only NeMo models was the combination of the fine-tuned Frame-VAD model with fine-tuned MSDD model under the telephonic configuration (45.86%). The best combination among all experiments was the Pyannote Segmentation Model as an external VAD with the clustering diarizer, also in telephonic configuration (38.80%). This result was further improved to 23.57% after post-processing. Finally, a statistical analysis confirmed that the television genre and the number of speakers significantly influence the performance of the Speaker Diarization system.