What Is Needed to Train a Slovak Large Language Model

Course/Event Essentials

Event/Course Start
Event/Course End
Event/Course Format
In person

Venue Information

Country: Slovakia
Venue Details: Click here

Training Content and Scope

Scientific Domain
Level of Instruction
Other
Sector of the Target Audience
Other (general public...)
Language of Instruction

Other Information

Organiser
Supporting Project(s)
EuroCC2/CASTIEL2
Event/Course Description

In recent years, large language models (LLMs) have dramatically changed natural language processing (NLP). A breakthrough was the use of transformer architectures, which are particularly suitable for processing larger text contexts, and generative language models. What is required to create such a model that communicates in Slovak? 
The typical amount of text needed to train a model capable of communicating grammatically correctly is roughly in the order of one trillion words. Is there even enough text in Slovak? The carefully collected and built over many years Slovak National Corpus currently has a size of 1.5 billion words. The web corpus is larger, currently around 4 billion words. The fondness of Slovak inhabitants for legal disputes is documented by the size of the corpus of court decisions, which has over 10 billion words and is currently the largest available corpus of Slovak texts. Other corpora are significantly smaller.
However, we can turn to multilingual LLMs, which we can fine-tune with Slovak data. It turns out that about 5 billion words are enough to “learn” a new language, which is already approaching the size of the web corpus and opens up possibilities to add Slovak to existing multilingual LLMs.