Corpora | SITTI

Corpora are large sets of electronic texts that are commonly used to analyse natural language usage.

The following corpora are compiled in CCL:

Monolingual

Corpus	Language	Annotation	Size
Corpus of Contemporary Lithuanian Language	Lithuanian	–	140.9m words
CORPUS.VDU.LT	Lithuanian	morphology	208.4m words
MATAS	Lithuanian	morphology	1.6m words
ALKSNIS 2.0	Lithuanian	syntax	2,355 sentences
ALKSNIS 3.0	Lithuanian	syntax	3,643 sentences
DELFI corpus	Lithuanian	morphology	70m words

Parallel corpora

Corpus	Language	Annotation	Size
Lygiagretus tekstynas	English-Lithuanian	–	2.025m words
“	Lithuanian-English	–	0.061m words
“	Czech-Lithuanian	–	0.536m words
“	Lithuanian-Czech	–	0.021m words
LILA	Lithuanian-Latvian-Lithuanian	–	9.360m words

Parallel corpora are original texts aligned to their translations, commonly sentence by sentence. Parallel corpora may contain texts in two or more languages. Parallel corpora need to be aligned.

Annotated or tagged corpora are corpora, where structural, grammatical or semantic text element are marked up with special meta tags (or annotations).