Spanish Language Models
=======================

This directory contains lm.binary files generated with KenLM build_binary
program. All the binary language models here include the vocabulary.

The language models available are the following:

- lm.binary.es (17MB): small and old language model by @nahuelproietto obtained
  from https://github.com/nahuelproietto/deepspeech-spanish-model
  Disclaimer: this LM has its own license, check the link before using it.

- lm.binary.es.aholab-2023-02-07 (261 MB): medium language model trained with
  the following corpora:
  - News Commentary: https://www.statmt.org/wmt13/training-monolingual-nc-v8.tgz
  - Europarl v7: https://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
  - News Crawl (articles from 2012): https://www.statmt.org/wmt13/training-monolingual-news-2012.tgz
  This LM creation has been partialy based on Jaco-Assistant team's work:
  https://gitlab.com/Jaco-Assistant/Scribosermo/-/tree/deepspeech?ref_type=tags#create-the-language-model

- lm.binary.es.aholab-2023-05-02 (1,7 GB): big language model trained with the
  following corpora:
  - Tatoeba, OpenSubtitles, TED, and GlobalVoices corpora from OPUS:
    https://opus.nlpl.eu/.
  - Wikipedia dump (2023-02-01): https://dumps.wikimedia.org/eswiki/
  This LM creation has been performed using a modified version of the covo
  toolkit optimized for speed: https://github.com/ftyers/commonvoice-utils

- lm.binary.es.aholab-2023-05-08 (1,7 GB): improved version of the previous
  model by adding the following corpora:
  - Common Voice 12.0 transcriptions of train and dev splits:
    https://commonvoice.mozilla.org/en/datasets.
  - LibriSpeech LM resources data: https://www.openslr.org/94/

- lm.binary.es.aholab-2023-05-10 (1,7 GB): 2023-05-08 LM version but with
  diacritics included in the alphabet.

- lm.binary.es.aholab-2023-09-13 (642M): improved version of the previous
  model (2023-05-02) by adding the following corpora updated:
  - Wikipedia dump of 2023-09-1.
  - Common Voice 14.0 transcriptions of train and dev splits:
    https://commonvoice.mozilla.org/en/datasets.

The novocab directory contains the language models without the vocabulary,
ready to be used by Deep Speech models.