{"id":1710,"date":"2019-05-03T19:27:00","date_gmt":"2019-05-03T17:27:00","guid":{"rendered":"http:\/\/aholab.ehu.eus\/wordpress\/?p=1710"},"modified":"2025-06-05T13:41:19","modified_gmt":"2025-06-05T11:41:19","slug":"speech-recognition-based-strategies-for-on-line-computer-assisted-language-learning-systems-for-basque","status":"publish","type":"post","link":"https:\/\/aholab.ehu.eus\/aholab\/speech-recognition-based-strategies-for-on-line-computer-assisted-language-learning-systems-for-basque\/","title":{"rendered":"2019, Igor Odriozola: Speech recognition based strategies for on-line Computer-assisted Language Learning systems for Basque"},"content":{"rendered":"<p><strong>Student<\/strong>: Igor Odriozola Sustaeta<\/p>\n<p><strong>Supervisors<\/strong>: Inma Hern\u00e1ez Rioja and Eva Navas Cord\u00f3n<\/p>\n<p><strong>Viva<\/strong>: 3th May 2019<\/p>\n<p>International PhD. Sobresaliente Cum Laude.<\/p>\n<p><a href=\"https:\/\/aholab.ehu.eus\/aholab\/public\/PhD\/OdriozolaPhD.pdf?_t=1626189501\"><strong>Document<\/strong><\/a><\/p>\n\n\n<p>There is a growing interest in the use of speech technology in computer-assisted language<br>learning (CALL) systems. This is mainly due to the fact that speech technologies have<br>considerably improved during the last years, and nowadays more and more people make<br>use of them, ever more naturally. Literature shows that two major points of interest are<br>involved in the use of Automatic Speech Recognition (ASR) in CALL: Computer-assisted<br>Pronunciation Training (CAPT) and Spoken Grammar Practise (SGP). In a CAPT<br>typical application, the user is required to read and record a sentence, and send it to<br>the learning application server. It returns a result, generally using a three-level colour<br>system indicating which phone has been correctly pronounced and which has not. SGP<br>applications are not very popular yet, but some examples are multiple-choice tests,<br>where users orally choose an answer between several choices, or even the Intelligent<br>Language Tutoring Systems, where learners respond to cues given by the computer<br>and the system provides feedback. Such tools can be used to strengthen the students\u2019<br>autonomy on their learning process, giving the opportunity of using voice to improve<br>their pronunciation or to do grammar exercises outside the classroom.<br>In this work, two applications have been considered: on the one hand, a classical<br>CAPT system, where the student records a predefined sentence and the system gives a<br>feedback about the pronunciation; on the other hand, a novel Word-by-Word Sentence<br>Verification (WWSV) system, where a sentence is verified sequentially, word by word,<br>and as soon as a word is detected it is displayed to the user. The WWSV tool gives the<br>option of creating a tool to solve grammar exercises orally (SGP). Both systems rely<br>on utterance verification techniques, as the popular Goodness-of-pronunciation (GOP)<br>score.<br>The acoustic database chosen to train the systems is the Basque Speecon-like database,<br>the only one publicly available for Basque, specifically designed for speech recognition<br>and recorded through microphone. This database presents several drawbacks, such as<br>the lack of a pronunciation lexicon and some annotation files. Furthermore, it contains<br>much dialectal speech, mainly in the free spontaneous part. In ASR, phonetic variations<br>can be modelled using a single acoustic model. However, CAPT systems require &#8220;clean&#8221;<br>models to use as reference. Thus, some work had to be carried out on the annotation<br>files of the database. A number of transcriptions have been changed and a new lexicon<br>has been created considering different dialectal alternatives. It is noticeable that the<br>new lexicon contains in average 4.12 different pronunciations per word.<br>The speech recognition software used in this thesis is AhoSR. It has been created<br>and developed by the Aholab research group, and it has been designed to cope with<br>different recognitions tasks. In this thesis, utterance verification techniques have been<br>implemented to be run together with the basic tasks. To do so, a parallel graph has<br>been implemented to obtain GOP scores. For CAPT and WWSV tasks, specific search<br>graphs have been added, in order to adapt to the needs of each of them. In addition,<br>sockets have been implemented in the audio-input module of AhoSR. This allows real<br>time performing when accessing the recogniser through the internet, and so it gives us<br>the opportunity for AhoSR to be installed on a server, with universal access.<br>Different ways to train Hidden Markov Models (HMM) have been analysed in depth.<br>Initially, HMMs of better quality were expected by means of using the new dictionary<br>with alternatives. However, results do not show this, probably because of the big amount<br>of alternative pronunciations. The addition of some manually corrected data (15 % of<br>the training set) allows obtaining similar results to those obtained using a single-entry<br>dictionary. In order to take advantage of the manually corrected transcriptions, different<br>ways of training HMMs have been analysed. Thus, we have found that slightly better<br>HMMs are achieved using data with few transcription errors in the initial stages of the<br>training and then using the whole database.<br>To build the initial system, two GOP distributions were considered necessary to classify<br>each phone: the distribution of the correctly pronounced phones and the distribution<br>of the incorrectly pronounced ones. The GOPs of the incorrectly pronounced phones<br>were obtained simulating errors and obtaining the GOP scores by forced alignment in<br>the AhoSR decoder. Thus, the thresholds between correctly and incorrectly uttered<br>phones were calculated as the Equal Error Rate (EER) point of both distributions. This<br>approach was implemented in an initial prototype, and several laboratory experiments<br>were performed which produced very good results. Then, the system was tested in more<br>realistic environments: Basque language schools, among 20 students. The objective<br>results along with the survey filled in by the 20 students who tested the system were<br>really promising.<br>The initial prototype was executed locally, and we felt the need of developing a more<br>universal system in order to be accessed from any device and anywhere. Thus, we took<br>advantage of the specifications of the recent HTML5 standard, which let the browser<br>access the audio input, regardless of the platform, by means of the audio API. This<br>has given us the opportunity to create a system accessible from any operative system.<br>Moreover, for the WWSV-based SGP task, another API of HTML5 has been used (the<br>web API ), which creates socket-like connections between the browser and the server, in<br>order to send audio data on the fly.<br>Several drawbacks have been managed for the on-line implementation of the system:<br>for example, due to the different devices that users will use to pick up audio, some kind<br>of parameter normalisation is needed. Furthermore, an on-line normalisation technique<br>is necessary, since in WWSV continuous feedback must be provided before the whole<br>signal has arrived to the recogniser. Different techniques have been tested to implement<br>Cepstral Mean and Variance Normalisation (CMVN) and estimate the initial values<br>cepstral means and variances. The best results have been obtained by a hybrid approach<br>proposed in this work, so that the initial means are estimated using the first N frames,<br>and initial variances are obtained from the training datasets.<br>In addition, a new CMVN technique has been devised in this thesis: the Multi-<br>Normalisation Scoring (MNS) based CMVN. MNS consists in generating multiple<br>observation likelihood scores by normalising the incoming Mel-Frequency Cepstral<br>Coefficients (MFCC) using means and variances computed from different speech datasets<br>recorded under different conditions. The MNS-based CMVN consists in computing the<br>probabilities of a frame to belong to different training datasets; thus, these probabilities<br>can be used as weights to calculate an estimation of the actual means and variances.<br>The results obtained are remarkable, mainly for clean signals. The greatest advantage<br>of using MNS is that the CMVN can perform on-line, frame by frame, with no need to<br>analyse the neighbouring frames or the frames of a segment to which it belongs.<br>Using the same MSN method, a novel and effective on-line Voice Activity Detector<br>(VAD) has been devised as well. In a validation experiment, comparing the results of our<br>MNS-based VAD with the results of two ITU-T VAD algorithms (G.720.1 and G.729b),<br>we have obtained better overall results, since the classification errors are considerably<br>lower for non-speech frames, and are comparable for speech frames. This makes our<br>system useful for systems that require low speech error rates and also for low non-speech<br>error rates.<br>Finally, Neural Networks have been used in an attempt to see the impact of different<br>parameters at the time of training a classifier. As a consequence, we have seen that<br>GOP scores are the most efficient parameters among durations and log-likelihoods of<br>the previous, current and posterior phones. The results of the experiments are coherent<br>with those obtained in the initial system.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Student: Igor Odriozola Sustaeta Supervisors: Inma Hern\u00e1ez Rioja and Eva Navas Cord\u00f3n Viva: 3th May 2019 International PhD. Sobresaliente Cum Laude. Document There is a growing interest in the use of speech technology in computer-assisted languagelearning (CALL) systems. This is mainly due to the fact that speech technologies haveconsiderably improved during the last years, and&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_es_post_content":"","_es_post_name":"","_es_post_excerpt":"","_es_post_title":"","_eu_post_content":"","_eu_post_name":"","_eu_post_excerpt":"","_eu_post_title":"","_en_post_content":"<p><strong>Student<\/strong>: Igor Odriozola Sustaeta<\/p>\n<p><strong>Supervisors<\/strong>: Inma Hern\u00e1ez Rioja and Eva Navas Cord\u00f3n<\/p>\n<p><strong>Viva<\/strong>: 3th May 2019<\/p>\n<p>International PhD. Sobresaliente Cum Laude.<\/p>\n<p><a href=\"https:\/\/aholab.ehu.eus\/aholab\/public\/PhD\/OdriozolaPhD.pdf?_t=1626189501\"><strong>Document<\/strong><\/a><\/p>\n\n<!-- wp:paragraph -->\n<p>There is a growing interest in the use of speech technology in computer-assisted language<br>learning (CALL) systems. This is mainly due to the fact that speech technologies have<br>considerably improved during the last years, and nowadays more and more people make<br>use of them, ever more naturally. Literature shows that two major points of interest are<br>involved in the use of Automatic Speech Recognition (ASR) in CALL: Computer-assisted<br>Pronunciation Training (CAPT) and Spoken Grammar Practise (SGP). In a CAPT<br>typical application, the user is required to read and record a sentence, and send it to<br>the learning application server. It returns a result, generally using a three-level colour<br>system indicating which phone has been correctly pronounced and which has not. SGP<br>applications are not very popular yet, but some examples are multiple-choice tests,<br>where users orally choose an answer between several choices, or even the Intelligent<br>Language Tutoring Systems, where learners respond to cues given by the computer<br>and the system provides feedback. Such tools can be used to strengthen the students\u2019<br>autonomy on their learning process, giving the opportunity of using voice to improve<br>their pronunciation or to do grammar exercises outside the classroom.<br>In this work, two applications have been considered: on the one hand, a classical<br>CAPT system, where the student records a predefined sentence and the system gives a<br>feedback about the pronunciation; on the other hand, a novel Word-by-Word Sentence<br>Verification (WWSV) system, where a sentence is verified sequentially, word by word,<br>and as soon as a word is detected it is displayed to the user. The WWSV tool gives the<br>option of creating a tool to solve grammar exercises orally (SGP). Both systems rely<br>on utterance verification techniques, as the popular Goodness-of-pronunciation (GOP)<br>score.<br>The acoustic database chosen to train the systems is the Basque Speecon-like database,<br>the only one publicly available for Basque, specifically designed for speech recognition<br>and recorded through microphone. This database presents several drawbacks, such as<br>the lack of a pronunciation lexicon and some annotation files. Furthermore, it contains<br>much dialectal speech, mainly in the free spontaneous part. In ASR, phonetic variations<br>can be modelled using a single acoustic model. However, CAPT systems require \"clean\"<br>models to use as reference. Thus, some work had to be carried out on the annotation<br>files of the database. A number of transcriptions have been changed and a new lexicon<br>has been created considering different dialectal alternatives. It is noticeable that the<br>new lexicon contains in average 4.12 different pronunciations per word.<br>The speech recognition software used in this thesis is AhoSR. It has been created<br>and developed by the Aholab research group, and it has been designed to cope with<br>different recognitions tasks. In this thesis, utterance verification techniques have been<br>implemented to be run together with the basic tasks. To do so, a parallel graph has<br>been implemented to obtain GOP scores. For CAPT and WWSV tasks, specific search<br>graphs have been added, in order to adapt to the needs of each of them. In addition,<br>sockets have been implemented in the audio-input module of AhoSR. This allows real<br>time performing when accessing the recogniser through the internet, and so it gives us<br>the opportunity for AhoSR to be installed on a server, with universal access.<br>Different ways to train Hidden Markov Models (HMM) have been analysed in depth.<br>Initially, HMMs of better quality were expected by means of using the new dictionary<br>with alternatives. However, results do not show this, probably because of the big amount<br>of alternative pronunciations. The addition of some manually corrected data (15 % of<br>the training set) allows obtaining similar results to those obtained using a single-entry<br>dictionary. In order to take advantage of the manually corrected transcriptions, different<br>ways of training HMMs have been analysed. Thus, we have found that slightly better<br>HMMs are achieved using data with few transcription errors in the initial stages of the<br>training and then using the whole database.<br>To build the initial system, two GOP distributions were considered necessary to classify<br>each phone: the distribution of the correctly pronounced phones and the distribution<br>of the incorrectly pronounced ones. The GOPs of the incorrectly pronounced phones<br>were obtained simulating errors and obtaining the GOP scores by forced alignment in<br>the AhoSR decoder. Thus, the thresholds between correctly and incorrectly uttered<br>phones were calculated as the Equal Error Rate (EER) point of both distributions. This<br>approach was implemented in an initial prototype, and several laboratory experiments<br>were performed which produced very good results. Then, the system was tested in more<br>realistic environments: Basque language schools, among 20 students. The objective<br>results along with the survey filled in by the 20 students who tested the system were<br>really promising.<br>The initial prototype was executed locally, and we felt the need of developing a more<br>universal system in order to be accessed from any device and anywhere. Thus, we took<br>advantage of the specifications of the recent HTML5 standard, which let the browser<br>access the audio input, regardless of the platform, by means of the audio API. This<br>has given us the opportunity to create a system accessible from any operative system.<br>Moreover, for the WWSV-based SGP task, another API of HTML5 has been used (the<br>web API ), which creates socket-like connections between the browser and the server, in<br>order to send audio data on the fly.<br>Several drawbacks have been managed for the on-line implementation of the system:<br>for example, due to the different devices that users will use to pick up audio, some kind<br>of parameter normalisation is needed. Furthermore, an on-line normalisation technique<br>is necessary, since in WWSV continuous feedback must be provided before the whole<br>signal has arrived to the recogniser. Different techniques have been tested to implement<br>Cepstral Mean and Variance Normalisation (CMVN) and estimate the initial values<br>cepstral means and variances. The best results have been obtained by a hybrid approach<br>proposed in this work, so that the initial means are estimated using the first N frames,<br>and initial variances are obtained from the training datasets.<br>In addition, a new CMVN technique has been devised in this thesis: the Multi-<br>Normalisation Scoring (MNS) based CMVN. MNS consists in generating multiple<br>observation likelihood scores by normalising the incoming Mel-Frequency Cepstral<br>Coefficients (MFCC) using means and variances computed from different speech datasets<br>recorded under different conditions. The MNS-based CMVN consists in computing the<br>probabilities of a frame to belong to different training datasets; thus, these probabilities<br>can be used as weights to calculate an estimation of the actual means and variances.<br>The results obtained are remarkable, mainly for clean signals. The greatest advantage<br>of using MNS is that the CMVN can perform on-line, frame by frame, with no need to<br>analyse the neighbouring frames or the frames of a segment to which it belongs.<br>Using the same MSN method, a novel and effective on-line Voice Activity Detector<br>(VAD) has been devised as well. In a validation experiment, comparing the results of our<br>MNS-based VAD with the results of two ITU-T VAD algorithms (G.720.1 and G.729b),<br>we have obtained better overall results, since the classification errors are considerably<br>lower for non-speech frames, and are comparable for speech frames. This makes our<br>system useful for systems that require low speech error rates and also for low non-speech<br>error rates.<br>Finally, Neural Networks have been used in an attempt to see the impact of different<br>parameters at the time of training a classifier. As a consequence, we have seen that<br>GOP scores are the most efficient parameters among durations and log-likelihoods of<br>the previous, current and posterior phones. The results of the experiments are coherent<br>with those obtained in the initial system.<\/p>\n<!-- \/wp:paragraph -->","_en_post_name":"speech-recognition-based-strategies-for-on-line-computer-assisted-language-learning-systems-for-basque","_en_post_excerpt":"","_en_post_title":"2019, Igor Odriozola: Speech recognition based strategies for on-line Computer-assisted Language Learning systems for Basque","edit_language":"en","footnotes":""},"categories":[68],"tags":[],"class_list":["post-1710","post","type-post","status-publish","format-standard","hentry","category-phd-thesis-finished"],"_links":{"self":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/posts\/1710","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/comments?post=1710"}],"version-history":[{"count":6,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/posts\/1710\/revisions"}],"predecessor-version":[{"id":3830,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/posts\/1710\/revisions\/3830"}],"wp:attachment":[{"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/media?parent=1710"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/categories?post=1710"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aholab.ehu.eus\/aholab\/wp-json\/wp\/v2\/tags?post=1710"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}