{"id":3947,"date":"2025-07-03T12:17:00","date_gmt":"2025-07-03T10:17:00","guid":{"rendered":"https:\/\/aholab.ehu.eus\/aholab\/?p=3947"},"modified":"2025-09-19T12:35:08","modified_gmt":"2025-09-19T10:35:08","slug":"","status":"publish","type":"post","link":"https:\/\/aholab.ehu.eus\/aholab\/eu\/gemma-meseguer-speaker-diarization-in-broadcast-audio-using-nvdia-nemo-models\/","title":{"rendered":"Gemma Meseguer: Speaker Diarization in Broadcast Audio using NVDIA NeMO Models","raw":"Gemma Meseguer: Speaker Diarization in Broadcast Audio using NVDIA NeMO Models"},"content":{"rendered":"\n<p><strong>Ikaslea:<\/strong> Gemma Meseguer Castillo<br><strong>Zuzendariak: <\/strong>Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja <br><strong>Defentsa-data: <\/strong>03\/07\/2025<\/p>\n\n\n\n<p>Master amaierako lan honek, NVIDIA NeMo plataformaren ereduak erabiliz, esatarien diarizazioa audio-emisioan du ardatz. Helburu nagusia da eredu horiek ingurune errealetan duten errendimendua ebaluatzea, hala nola telebista- eta irrati-programetan, non solaskide ugari eta askotariko egoera akustikoak baitaude.<\/p>\n\n\n\n<p>Diarizazio-sistema desberdinak alderatu dira, NeMo eta pyannote.audio barne, eta Diarization Error Rate (DER) bezalako metrikak aztertu dira, akatsak neurtzen dituena, hala nola, detektatze galduak, esatarien arteko nahastea eta alarma faltsuak. Itzuli bezalako tresnen erabilera ere arakatu da, prozesamendu eleanitza errazteko.<\/p>\n\n\n\n<p>Emaitzek erakusten dutenez, egungo ereduek agertoki sinpleetan errendimendu ona eskaintzen duten arren, oraindik erronkak dituzte ahotsak edo hiztun anitz gainjartzen diren egoeretan. Lan honek aplikazio batzuetarako sistema sendoagoak garatzen laguntzen du, hala nola, transkripzio automatikoa, bitartekoen monitorizazioa eta irisgarritasuna hobetzea.<\/p>\n","protected":false,"raw":"<!-- wp:paragraph -->\n<p><strong>Ikaslea:<\/strong> Gemma Meseguer Castillo<br><strong>Zuzendariak: <\/strong>Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja <br><strong>Defentsa-data: <\/strong>03\/07\/2025<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Master amaierako lan honek, NVIDIA NeMo plataformaren ereduak erabiliz, esatarien diarizazioa audio-emisioan du ardatz. Helburu nagusia da eredu horiek ingurune errealetan duten errendimendua ebaluatzea, hala nola telebista- eta irrati-programetan, non solaskide ugari eta askotariko egoera akustikoak baitaude.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Diarizazio-sistema desberdinak alderatu dira, NeMo eta pyannote.audio barne, eta Diarization Error Rate (DER) bezalako metrikak aztertu dira, akatsak neurtzen dituena, hala nola, detektatze galduak, esatarien arteko nahastea eta alarma faltsuak. Itzuli bezalako tresnen erabilera ere arakatu da, prozesamendu eleanitza errazteko.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Emaitzek erakusten dutenez, egungo ereduek agertoki sinpleetan errendimendu ona eskaintzen duten arren, oraindik erronkak dituzte ahotsak edo hiztun anitz gainjartzen diren egoeretan. Lan honek aplikazio batzuetarako sistema sendoagoak garatzen laguntzen du, hala nola, transkripzio automatikoa, bitartekoen monitorizazioa eta irisgarritasuna hobetzea.<\/p>\n<!-- \/wp:paragraph -->"},"excerpt":{"rendered":"Student: Gemma Meseguer CastilloAdvisors: Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja Thesis Defense Date: 03\/07\/2025 Within the field of Speech Processing, the task of Speaker Diarization has gained a more important role in recent years. Frameworks such as NVIDIA NeMo offer configurations for Speaker Diarization focused on domains such as telephone conversations or...","protected":false,"raw":""},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_es_post_content":"<!-- wp:paragraph -->\n<p><strong>Estudiante:<\/strong> Gemma Meseguer Castillo<br><strong>Directoras: <\/strong>Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja <br><strong>Fecha de defensa: <\/strong>03\/07\/2025<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Este trabajo de fin de m\u00e1ster se centra en la diarizaci\u00f3n de locutores en audio de emisi\u00f3n, utilizando modelos de la plataforma NVIDIA NeMo. El objetivo principal es evaluar el rendimiento de estos modelos en entornos reales, como programas de televisi\u00f3n y radio, donde hay m\u00faltiples interlocutores y condiciones ac\u00fasticas variadas.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Se han comparado diferentes sistemas de diarizaci\u00f3n, incluyendo NeMo y pyannote.audio, y se han analizado m\u00e9tricas como el Diarization Error Rate (DER), que mide errores como detecciones perdidas, confusi\u00f3n entre locutores y falsas alarmas. Tambi\u00e9n se ha explorado el uso de herramientas como Itzuli para facilitar el procesamiento multiling\u00fce.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Los resultados muestran que, aunque los modelos actuales ofrecen buen rendimiento en escenarios simples, todav\u00eda presentan desaf\u00edos en situaciones con solapamiento de voces o m\u00faltiples hablantes. Este trabajo contribuye al desarrollo de sistemas m\u00e1s robustos para aplicaciones como la transcripci\u00f3n autom\u00e1tica, la monitorizaci\u00f3n de medios y la mejora de la accesibilidad.<\/p>\n<!-- \/wp:paragraph -->","_es_post_name":"","_es_post_excerpt":"","_es_post_title":"Gemma Meseguer: Speaker Diarization in Broadcast Audio using NVDIA NeMO Models","_eu_post_content":"<!-- wp:paragraph -->\n<p><strong>Ikaslea:<\/strong> Gemma Meseguer Castillo<br><strong>Zuzendariak: <\/strong>Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja <br><strong>Defentsa-data: <\/strong>03\/07\/2025<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Master amaierako lan honek, NVIDIA NeMo plataformaren ereduak erabiliz, esatarien diarizazioa audio-emisioan du ardatz. Helburu nagusia da eredu horiek ingurune errealetan duten errendimendua ebaluatzea, hala nola telebista- eta irrati-programetan, non solaskide ugari eta askotariko egoera akustikoak baitaude.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Diarizazio-sistema desberdinak alderatu dira, NeMo eta pyannote.audio barne, eta Diarization Error Rate (DER) bezalako metrikak aztertu dira, akatsak neurtzen dituena, hala nola, detektatze galduak, esatarien arteko nahastea eta alarma faltsuak. Itzuli bezalako tresnen erabilera ere arakatu da, prozesamendu eleanitza errazteko.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Emaitzek erakusten dutenez, egungo ereduek agertoki sinpleetan errendimendu ona eskaintzen duten arren, oraindik erronkak dituzte ahotsak edo hiztun anitz gainjartzen diren egoeretan. Lan honek aplikazio batzuetarako sistema sendoagoak garatzen laguntzen du, hala nola, transkripzio automatikoa, bitartekoen monitorizazioa eta irisgarritasuna hobetzea.<\/p>\n<!-- \/wp:paragraph -->","_eu_post_name":"","_eu_post_excerpt":"","_eu_post_title":"Gemma Meseguer: Speaker Diarization in Broadcast Audio using NVDIA NeMO Models","_en_post_content":"<!-- wp:paragraph -->\n<p><strong>Student:<\/strong> Gemma Meseguer Castillo<br><strong>Advisors: <\/strong>Christoforos Souganidis, Eva Navas Cord\u00f3n and Inma Hern\u00e1ez Rioja <br><strong>Thesis Defense Date: <\/strong>03\/07\/2025<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Within the field of Speech Processing, the task of Speaker Diarization has gained a more important role in recent years. Frameworks such as NVIDIA NeMo offer configurations for Speaker Diarization focused on domains such as telephone conversations or recorded meetings. However, the broadcast domain is not as well supported. <br>This work focuses on whether NVIDIA NeMo can be effectively used for Broadcast Speaker Diarization by exploring in particular the modules of the Voice Activity Detection (VAD) and Multi-scale Diarization Decoder (MSDD) models; and what factors can influence the performance of this system. <br>Five experiments have been performed combining all possibilities of VAD models and diarizers under two configurations: general and telephonic. The best result using only NeMo models was the combination of the fine-tuned Frame-VAD model with fine-tuned MSDD model under the telephonic configuration (45.86%). The best combination among all experiments was the Pyannote Segmentation Model as an external VAD with the clustering diarizer, also in telephonic configuration (38.80%). This result was further improved to 23.57% after post-processing. Finally, a statistical analysis confirmed that the television genre and the number of speakers significantly influence the performance of the Speaker Diarization system.<\/p>\n<!-- \/wp:paragraph -->","_en_post_name":"gemma-meseguer-speaker-diarization-in-broadcast-audio-using-nvdia-nemo-models","_en_post_excerpt":"","_en_post_title":"Gemma Meseguer: Speaker Diarization in Broadcast Audio using NVDIA NeMO Models","edit_language":"eu","footnotes":""},"categories":[62],"tags":[],"class_list":["post-3947","post","type-post","status-publish","format-standard","hentry","category-master-thesis-finished"],"_links":{"self":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/posts\/3947","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/comments?post=3947"}],"version-history":[{"count":1,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/posts\/3947\/revisions"}],"predecessor-version":[{"id":3948,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/posts\/3947\/revisions\/3948"}],"wp:attachment":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/media?parent=3947"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/categories?post=3947"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/eu\/wp-json\/wp\/v2\/tags?post=3947"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}