Project duration: December 2024 – April 2026 (16.5 months).
Project Implementer: State Agency for Digital Solutions (VSSA)
Project manager: Arminas Rakauskas
- Contract value: 5,100,000 euro
- Consortium: VDU, UAB Neurotechnology, UAB Tilde Lietuva, MB Krilas
- Consortium leader: Andrius Utka

The main objective of the project is to collect the necessary linguistic resources, organise them appropriately, and prepare a comprehensive, large-scale, high-quality Lithuanian language corpus that meets the needs of artificial intelligence technology development and digital and statistical language research. Using the created corpus, develop pre-trained neural models of the Lithuanian language that possess comprehensive factual knowledge of the Lithuanian language

1.1. Project objectives and results

1. Compile the General Lithuanian Language Corpus (BLKT).
2. Develop two vectorised models of the Lithuanian language.
3. Develop a software solution that would enable text generation.

1.2. Procurement objectives and results

- Create a General Lithuanian Language Corpus (BLKT) of appropriate scope and detail; scope: 3.5 billion words.
- Create two Lithuanian language vectorised models (small and large) based on the corpus.
- Develop the validation tools specified in the technical specifications.
- Make the project results available via open access, as specified in the technical specifications.
- Make the project results available to users as open resources that anyone can use freely and at no cost.
- To provide services of the highest quality in accordance with the technical specifications and the project schedule.

2. Completed results

1. 1. - - On November 3, 2025, the first practical result of the project—The Small Lithuanian Language Vectorised Model (LT-MLKM-modernBERT)—became publicly available and open for use. The model is available on the Hugging Face platform and can be accessed via this link: https://huggingface.co/VSSA-SDSA/LT-MLKM-modernBERT
      - On April 15, 2026, the General Lithuanian Language Corpus became publicly available on the CLARIN-LT (https://hdl.handle.net/20.500.11821/95) and Hugging face (https://huggingface.co/datasets/VSSA-SDSA/LT_AI_BLKT) repositories.
      - On April 15, 2026, The Large Lithuanian Language Vectorised Model (DLKVM) became publicly available and open for use. The model is available on the Hugging Face platform and can be accessed via this link: https://huggingface.co/VSSA-SDSA/LT_AI_DLKVM
3. Licensing

The corpus and models are licensed under NewGenLTU openRAIL licences:
NewGenLTU -M openRAIL (for models)
NewGenLTU -D open RAIL (for datasets)