Corpora are large sets of electronic texts that are commonly used to analyse natural language usage.
The following corpora are compiled in CCL:
Monolingual
Corpus | Language | Annotation | Size |
---|---|---|---|
Corpus of Contemporary Lithuanian Language | Lithuanian | – | 140.9m words |
CORPUS.VDU.LT | Lithuanian | morphology | 208.4m words |
MATAS | Lithuanian | morphology | 1.6m words |
ALKSNIS 2.0 | Lithuanian | syntax | 2,355 sentences |
ALKSNIS 3.0 | Lithuanian | syntax | 3,643 sentences |
DELFI corpus | Lithuanian | morphology | 70m words |
Parallel corpora
Corpus | Language | Annotation | Size |
---|---|---|---|
Lygiagretus tekstynas | English-Lithuanian | – | 2.025m words |
“ | Lithuanian-English | – | 0.061m words |
“ | Czech-Lithuanian | – | 0.536m words |
“ | Lithuanian-Czech | – | 0.021m words |
LILA | Lithuanian-Latvian-Lithuanian | – | 9.360m words |
Parallel corpora are original texts aligned to their translations, commonly sentence by sentence. Parallel corpora may contain texts in two or more languages. Parallel corpora need to be aligned.
Annotated or tagged corpora are corpora, where structural, grammatical or semantic text element are marked up with special meta tags (or annotations).