Development of the General Lithuanian Language Corpus and Vectorized Models

 

 

  • Project duration: December 2024 – April 2026 (16.5 months).
  • Project Implementer: State Agency for Digital Solutions (VSSA)
  • Project manager: Arminas Rakauskas
    • Contract value: 5,100,000 euro
    • Consortium: VDU, UAB Neurotechnology, UAB Tilde Lietuva, MB Krilas
    • Consortium leader: Andrius Utka

The main objective of the project is to collect the necessary linguistic resources, organize them appropriately, and prepare a comprehensive, large-scale, high-quality Lithuanian language corpus that meets the needs of artificial intelligence technology development and digital and statistical language research. Using the created corpus, develop pre-trained neural models of the Lithuanian language that possess comprehensive factual knowledge of the Lithuanian language

1.1. Project objectives and results

    1. Compile the General Lithuanian Language Corpus (BLKT).
    2. Develop two vectorized models of the Lithuanian language.
    3. Develop a software solution that would enable text generation.

1.2. Procurement objectives and results

    • Create a General Lithuanian Language Corpus (BLKT) of appropriate scope and detail; scope: 3.5 billion words.
    • Create two Lithuanian language vectorized models (small and large) based on the corpus.
    • Develop the validation tools specified in the technical specifications.
    • Make the project results available via open access, as specified in the technical specifications.
    • Make the project results available to users as open resources that anyone can use freely and at no cost.
    • To provide services of the highest quality in accordance with the technical specifications and the project schedule.

2. Completed results

    1.  


      Project Implementer

       

       


      Consortium

       

       


      Data providers