Student: Igor Odriozola Sustaeta
Supervisors: Inma Hernáez Rioja and Eva Navas Cordón
Viva: 3th May 2019
International PhD. Sobresaliente Cum Laude.
There is a growing interest in the use of speech technology in computer-assisted language
learning (CALL) systems. This is mainly due to the fact that speech technologies have
considerably improved during the last years, and nowadays more and more people make
use of them, ever more naturally. Literature shows that two major points of interest are
involved in the use of Automatic Speech Recognition (ASR) in CALL: Computer-assisted
Pronunciation Training (CAPT) and Spoken Grammar Practise (SGP). In a CAPT
typical application, the user is required to read and record a sentence, and send it to
the learning application server. It returns a result, generally using a three-level colour
system indicating which phone has been correctly pronounced and which has not. SGP
applications are not very popular yet, but some examples are multiple-choice tests,
where users orally choose an answer between several choices, or even the Intelligent
Language Tutoring Systems, where learners respond to cues given by the computer
and the system provides feedback. Such tools can be used to strengthen the students’
autonomy on their learning process, giving the opportunity of using voice to improve
their pronunciation or to do grammar exercises outside the classroom.
In this work, two applications have been considered: on the one hand, a classical
CAPT system, where the student records a predefined sentence and the system gives a
feedback about the pronunciation; on the other hand, a novel Word-by-Word Sentence
Verification (WWSV) system, where a sentence is verified sequentially, word by word,
and as soon as a word is detected it is displayed to the user. The WWSV tool gives the
option of creating a tool to solve grammar exercises orally (SGP). Both systems rely
on utterance verification techniques, as the popular Goodness-of-pronunciation (GOP)
score.
The acoustic database chosen to train the systems is the Basque Speecon-like database,
the only one publicly available for Basque, specifically designed for speech recognition
and recorded through microphone. This database presents several drawbacks, such as
the lack of a pronunciation lexicon and some annotation files. Furthermore, it contains
much dialectal speech, mainly in the free spontaneous part. In ASR, phonetic variations
can be modelled using a single acoustic model. However, CAPT systems require “clean”
models to use as reference. Thus, some work had to be carried out on the annotation
files of the database. A number of transcriptions have been changed and a new lexicon
has been created considering different dialectal alternatives. It is noticeable that the
new lexicon contains in average 4.12 different pronunciations per word.
The speech recognition software used in this thesis is AhoSR. It has been created
and developed by the Aholab research group, and it has been designed to cope with
different recognitions tasks. In this thesis, utterance verification techniques have been
implemented to be run together with the basic tasks. To do so, a parallel graph has
been implemented to obtain GOP scores. For CAPT and WWSV tasks, specific search
graphs have been added, in order to adapt to the needs of each of them. In addition,
sockets have been implemented in the audio-input module of AhoSR. This allows real
time performing when accessing the recogniser through the internet, and so it gives us
the opportunity for AhoSR to be installed on a server, with universal access.
Different ways to train Hidden Markov Models (HMM) have been analysed in depth.
Initially, HMMs of better quality were expected by means of using the new dictionary
with alternatives. However, results do not show this, probably because of the big amount
of alternative pronunciations. The addition of some manually corrected data (15 % of
the training set) allows obtaining similar results to those obtained using a single-entry
dictionary. In order to take advantage of the manually corrected transcriptions, different
ways of training HMMs have been analysed. Thus, we have found that slightly better
HMMs are achieved using data with few transcription errors in the initial stages of the
training and then using the whole database.
To build the initial system, two GOP distributions were considered necessary to classify
each phone: the distribution of the correctly pronounced phones and the distribution
of the incorrectly pronounced ones. The GOPs of the incorrectly pronounced phones
were obtained simulating errors and obtaining the GOP scores by forced alignment in
the AhoSR decoder. Thus, the thresholds between correctly and incorrectly uttered
phones were calculated as the Equal Error Rate (EER) point of both distributions. This
approach was implemented in an initial prototype, and several laboratory experiments
were performed which produced very good results. Then, the system was tested in more
realistic environments: Basque language schools, among 20 students. The objective
results along with the survey filled in by the 20 students who tested the system were
really promising.
The initial prototype was executed locally, and we felt the need of developing a more
universal system in order to be accessed from any device and anywhere. Thus, we took
advantage of the specifications of the recent HTML5 standard, which let the browser
access the audio input, regardless of the platform, by means of the audio API. This
has given us the opportunity to create a system accessible from any operative system.
Moreover, for the WWSV-based SGP task, another API of HTML5 has been used (the
web API ), which creates socket-like connections between the browser and the server, in
order to send audio data on the fly.
Several drawbacks have been managed for the on-line implementation of the system:
for example, due to the different devices that users will use to pick up audio, some kind
of parameter normalisation is needed. Furthermore, an on-line normalisation technique
is necessary, since in WWSV continuous feedback must be provided before the whole
signal has arrived to the recogniser. Different techniques have been tested to implement
Cepstral Mean and Variance Normalisation (CMVN) and estimate the initial values
cepstral means and variances. The best results have been obtained by a hybrid approach
proposed in this work, so that the initial means are estimated using the first N frames,
and initial variances are obtained from the training datasets.
In addition, a new CMVN technique has been devised in this thesis: the Multi-
Normalisation Scoring (MNS) based CMVN. MNS consists in generating multiple
observation likelihood scores by normalising the incoming Mel-Frequency Cepstral
Coefficients (MFCC) using means and variances computed from different speech datasets
recorded under different conditions. The MNS-based CMVN consists in computing the
probabilities of a frame to belong to different training datasets; thus, these probabilities
can be used as weights to calculate an estimation of the actual means and variances.
The results obtained are remarkable, mainly for clean signals. The greatest advantage
of using MNS is that the CMVN can perform on-line, frame by frame, with no need to
analyse the neighbouring frames or the frames of a segment to which it belongs.
Using the same MSN method, a novel and effective on-line Voice Activity Detector
(VAD) has been devised as well. In a validation experiment, comparing the results of our
MNS-based VAD with the results of two ITU-T VAD algorithms (G.720.1 and G.729b),
we have obtained better overall results, since the classification errors are considerably
lower for non-speech frames, and are comparable for speech frames. This makes our
system useful for systems that require low speech error rates and also for low non-speech
error rates.
Finally, Neural Networks have been used in an attempt to see the impact of different
parameters at the time of training a classifier. As a consequence, we have seen that
GOP scores are the most efficient parameters among durations and log-likelihoods of
the previous, current and posterior phones. The results of the experiments are coherent
with those obtained in the initial system.