
On June 11, 2025, from 10:00 to 11:00, an exclusive webinar will be held featuring speaker Marek Dobeš, focussing on language modeling for low-resource languages, organised by the National Competence Centres for HPC in Slovakia and Italy.
The rise of large language models (LLMs), such as GPT and LLaMA, has highlighted a significant challenge: the scarcity of data for less widely spoken languages, including Slovak. This limits the quality of AI models for these languages. Our project aims to overcome this barrier through innovative and advanced strategies, supported by the Leonardo supercomputer, one of the most powerful in Europe.
Key strategies of the project include:
- Bilingual Slovak-English dataset generation: Automatic translation assisted by LLaMA 3.3 70B Instruct to create high-quality datasets for training language models.
- Automated summarisation of scientific texts in Slovak: Using Gemini Flash Experimental and the PLOS database, summaries of research articles are produced, enhancing specialised terminology in the models.
- Cultural context enrichment for Slovak: Development of specific datasets to improve the understanding of Slovak cultural and contextual topics, which are still underrepresented by existing models like ChatGPT.
The project leverages the computing power of the Slovak national supercomputer Devana and the Leonardo supercomputer in Italy to train next-generation language models, thereby improving accuracy and cultural relevance of LLMs in Slovak.
Although the focus is on Slovak, the methods developed are applicable to many other low-resource languages worldwide. The project invites international collaborators to join in promoting European cooperation in high-performance computing and fostering a more inclusive, multilingual artificial intelligence.
Join the webinar to discover how the combination of supercomputing and linguistic innovation can open new frontiers for the development of language models for underrepresented languages!