DVITAS - Bilingual Automatic Terminology Extraction

Welcome

DVITAS (Bilingual Automatic Terminology Extraction) is a project implemented by the researchers of Vytautas Magnus University and Mykolas Romeris University and funded by the Research Council of Lithuania.

During the project a methodology for automatic extraction of bilingual (English-Lithuanian) terms for a special domain from parallel and comparable corpora, when one of the languages is under-resourced and morphologically rich, was developed. The project uses Cybersecurity (CS) terminology as a special domain. In addition, an open bilingual database of cybersecurity terms was created which was based on the empirical data and reflects the use of cybersecurity terms in texts of various genres and types in national and international settings.

The CS domain was chosen because of its special relevance for today’s information society. This area is particularly dynamic as new documents of the CS area are constantly drawn up, new concepts are developed, but the terminology has not been fixed in the Lithuanian language yet. Thus, the new CS concepts are usually expressed by several terms, often by the name used in the original (English) language or as hybrids (combinations of English and Lithuanian lexical items). Therefore, the CS termbase is now particularly relevant to drafters of legal and administrative acts, translators, IT professionals, and the general public.

For the development of an innovative methodology for automatic extraction of terminologic data from bilingual resources, various state-of-the-art machine learning algorithms and neural networks for bilingual term extraction were tested and applied to CS dataset. Such methods have not been previously applied in Lithuania yet. We hope that the developed database and methodology could serve as a model for development of terminology bases in other domains.