Corpora are large sets of electronic texts that are commonly used to analyse natural language usage.
The following corpora are compiled in CCL:
|Corpus of Contemporary Lithuanian Language||Lithuanian||–||140.9m words|
|ALKSNIS 2.0||Lithuanian||syntax||2,355 sentences|
|ALKSNIS 3.0||Lithuanian||syntax||3,643 sentences|
|DELFI corpus||Lithuanian||morphology||70m words|
|Lygiagretus tekstynas||English-Lithuanian||–||2.025m words|
Parallel corpora are original texts aligned to their translations, commonly sentence by sentence. Parallel corpora may contain texts in two or more languages. Parallel corpora need to be aligned.
Annotated or tagged corpora are corpora, where structural, grammatical or semantic text element are marked up with special meta tags (or annotations).