Corpora

Corpora are large sets of electronic texts that are commonly used to analyse natural language usage.

The following corpora are compiled in CCL:

Monolingual

Corpus Language Annotation Size
Corpus of Contemporary Lithuanian Language Lithuanian 140.9m words
CORPUS.VDU.LT Lithuanian morphology 208.4m words
MATAS Lithuanian morphology 1.6m words
ALKSNIS 2.0 Lithuanian syntax 2,355 sentences
ALKSNIS 3.0 Lithuanian syntax 3,643 sentences
DELFI corpus Lithuanian morphology 70m words

Parallel corpora

Corpus Language Annotation Size
Lygiagretus tekstynas English-Lithuanian 2.025m words
Lithuanian-English 0.061m words
Czech-Lithuanian 0.536m words
Lithuanian-Czech 0.021m words
LILA Lithuanian-Latvian-Lithuanian 9.360m words

 

Parallel corpora are original texts aligned to their translations, commonly sentence by sentence. Parallel corpora may contain texts in two or more languages. Parallel corpora need to be aligned.

Annotated or tagged corpora are corpora, where structural, grammatical or semantic text element are marked up with special meta tags (or annotations).