

- Project duration: December 2024 – April 2026 (16.5 months).
- Project Implementer: State Agency for Digital Solutions (VSSA)
- Project manager: Arminas Rakauskas
- Contract value: 5,100,000 euro
- Consortium: VDU, UAB Neurotechnology, UAB Tilde Lietuva, MB Krilas
- Consortium leader: Andrius Utka
The main objective of the project is to collect the necessary linguistic resources, organize them appropriately, and prepare a comprehensive, large-scale, high-quality Lithuanian language corpus that meets the needs of artificial intelligence technology development and digital and statistical language research. Using the created corpus, develop pre-trained neural models of the Lithuanian language that possess comprehensive factual knowledge of the Lithuanian language
1.1. Project objectives and results
-
- Compile the General Lithuanian Language Corpus (BLKT).
- Develop two vectorized models of the Lithuanian language.
- Develop a software solution that would enable text generation.
1.2. Procurement objectives and results
-
- Create a General Lithuanian Language Corpus (BLKT) of appropriate scope and detail; scope: 3.5 billion words.
- Create two Lithuanian language vectorized models (small and large) based on the corpus.
- Develop the validation tools specified in the technical specifications.
- Make the project results available via open access, as specified in the technical specifications.
- Make the project results available to users as open resources that anyone can use freely and at no cost.
- To provide services of the highest quality in accordance with the technical specifications and the project schedule.
2. Completed results
-
-
-
- On November 3, 2025, the first practical result of the project—The Small Lithuanian Language Vectorized Model (LT-MLKM-modernBERT)—became publicly available and open for use. The model is available on the Hugging Face platform and can be access via this link: https://huggingface.co/VSSA-SDSA/LT-MLKM-modernBERT
- On April 15, 2026, the General Lithuanian Language Corpus became publicly available on the CLARIN-LT () and Hugging face (https://huggingface.co/datasets/VSSA-SDSA/LT_AI_BLKT) repositories.
Project Implementer

Consortium




Data providers




-
-






