Spanish Language Models ======================= This directory contains lm.binary files generated with KenLM build_binary program. All the binary language models here include the vocabulary. The language models available are the following: - lm.binary.es (17MB): small and old language model by @nahuelproietto obtained from https://github.com/nahuelproietto/deepspeech-spanish-model Disclaimer: this LM has its own license, check the link before using it. - lm.binary.es.aholab-2023-02-07 (261 MB): medium language model trained with the following corpora: - News Commentary: https://www.statmt.org/wmt13/training-monolingual-nc-v8.tgz - Europarl v7: https://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz - News Crawl (articles from 2012): https://www.statmt.org/wmt13/training-monolingual-news-2012.tgz This LM creation has been partialy based on Jaco-Assistant team's work: https://gitlab.com/Jaco-Assistant/Scribosermo/-/tree/deepspeech?ref_type=tags#create-the-language-model - lm.binary.es.aholab-2023-05-02 (1,7 GB): big language model trained with the following corpora: - Tatoeba, OpenSubtitles, TED, and GlobalVoices corpora from OPUS: https://opus.nlpl.eu/. - Wikipedia dump (2023-02-01): https://dumps.wikimedia.org/eswiki/ This LM creation has been performed using a modified version of the covo toolkit optimized for speed: https://github.com/ftyers/commonvoice-utils - lm.binary.es.aholab-2023-05-08 (1,7 GB): improved version of the previous model by adding the following corpora: - Common Voice 12.0 transcriptions of train and dev splits: https://commonvoice.mozilla.org/en/datasets. - LibriSpeech LM resources data: https://www.openslr.org/94/ - lm.binary.es.aholab-2023-05-10 (1,7 GB): 2023-05-08 LM version but with diacritics included in the alphabet. - lm.binary.es.aholab-2023-09-13 (642M): improved version of the previous model (2023-05-02) by adding the following corpora updated: - Wikipedia dump of 2023-09-1. - Common Voice 14.0 transcriptions of train and dev splits: https://commonvoice.mozilla.org/en/datasets. The novocab directory contains the language models without the vocabulary, ready to be used by Deep Speech models.