MATAS – Morphologically Annotated Lithuanian Corpus

MATAS is a morphologically annotated Lithuanian corpus (manually checked).

The compilation of MATAS started in 2000-2005 at CCL. The compilation of MATAS was supported by the State Commission of Lithuanian Language, State fund of Science and studies. Later the corpus was developed within EU structural fund projects SEMANTIKA and SEMANTIKA-2 (see Projects).


MATAS 1.0

MATAS 1.0 can be downloaded from the CLARIN-LT repository here:

https://clarin.vdu.lt/xmlui/handle/20.500.11821/33

SIZE

  • Wordform count: 1,693,410
  • Sentence count: 144,047

GENRES

Contains 5 genres:

  • Documents (14%)
  • Fiction (19%)
  • Periodicals (36%)
  • Scientific texts (24%)
  • Transcripts(7%)

The new version implements major improvements:

  • Many errors and inconsistencies have been corrected;
  • Substandard and obscene language has been marked-up;
  • Texts are presented in two formats tab delimited word per line (TAB-WPL) and CONLLU;
  • Three different tagsets are used: Multext-EAST, UD and Jablonskis.

MATAS 0.2

MATAS can be downloaded at the CLARIN-LT repository here.

http://hdl.handle.net/20.500.11821/9

The corpus contains 4 parts:

  • Documents (21%),
  • Fiction (19%)
  • Periodicals (36%)
  • Scientific texts (24%)

Wordform count: 1,641,263

Version: v0.2

Files: 92

Encoding: UTF-8

Tagset: Human-readable (Lithuanian tags) e.g. <word=”liepos” lemma=”liepa” type=”dktv mot.gim vnsk K”>

Date: 2014.08.06

Please use the following text to cite this item: Rimkutė E., Daudaravičius V., Utka A. 2007: Morphological Annotation of the Lithuanian Corpus. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; Workshop Balto-Slavonic Natural Language Processing 2007, Prague, 94–99.