Estudiante: Maite Fontecha
Directora: Eva Navas, Inma Hernáez
Fecha de defensa: Septiembre de 2022
Descripción:
Text-to-speech (TTS) generates speech from text. This tool helps improve people’s quality
of life. However, when extending these models to support languages like Spanish, we find
scarce databases, data processing tools, and model training resources.
In this thesis, I implemented and evaluated a Spanish TTS model on FastPitch with a 10
hour database. FastPitch is a neural network-based end-to-end TTS system that allows for
prosody transformations. I first researched state-of-art TTS and preprocessed the dataset,
then implemented and evaluated the model. As a result, several resources are provided:
tools for raw database processing, methods for linguistic module adaptation, a clean dataset
and a quality TTS system in Spanish.
This model’s quality is compared with two vocoders (WaveGlow/HiFiGan) and two other
state-of-art acoustic models (FastSpeech2/Tacotron2). The FastPitch model synthesized
with HiFiGan vocoder obtained the highest quality results. To conclude, prosody transformation
experiments at inference resulted successful with this FastPitch Spanish TTS.
Keywords: Text-To-Speech, Spanish, acoustic models, data preprocessing, Deep Neural
Networks.