Basque Language Models ====================== This directory contains lm.binary files generated with KenLM build_binary program. All the binary language models here include the vocabulary. The language models available are the following: - 5gram.bin (11 GB): a big language model trained with the following corpora: - Tatoeba, OpenSubtitles, TED, GlobalVoices, and other corpora from OPUS: https://opus.nlpl.eu/ - Wikipedia dump (2023-09-20): https://dumps.wikimedia.org/euwiki/ - EusCrawl 1.0: https://ixa.ehu.eus/euscrawl/ This LM has been preprocessed using commonvoice-utils: https://github.com/ftyers/commonvoice-utils The final corpus is included in the corpora.txt file.