Course/Event Essentials
Training Content and Scope
Other Information
In recent years, large language models (LLMs) have dramatically changed natural language processing (NLP). A breakthrough was the use of transformer architectures, which are particularly suitable for processing larger text contexts, and generative language models. What is required to create such a model that communicates in Slovak?
The typical amount of text needed to train a model capable of communicating grammatically correctly is roughly in the order of one trillion words. Is there even enough text in Slovak? The carefully collected and built over many years Slovak National Corpus currently has a size of 1.5 billion words. The web corpus is larger, currently around 4 billion words. The fondness of Slovak inhabitants for legal disputes is documented by the size of the corpus of court decisions, which has over 10 billion words and is currently the largest available corpus of Slovak texts. Other corpora are significantly smaller.
However, we can turn to multilingual LLMs, which we can fine-tune with Slovak data. It turns out that about 5 billion words are enough to “learn” a new language, which is already approaching the size of the web corpus and opens up possibilities to add Slovak to existing multilingual LLMs.