Machine learning for language technologies (ML²)

Machine Learning for Language Technology (ML²)

This interdisciplinary research group carries out research and experimental development to investigate, develop, and apply state-of-the-art machine learning techniques for the analysis and processing of natural language (text and speech), with a particular focus on intelligent technologies.

I. Areas of research

The research and experimental development activities of the ML2 research group are listed below (including, but not limited to):

  • research and applications in speech-to-text (speech-to-text) recognition and analysis;
  • research and applications in text-to-speech synthesis using neural networks;
  • development and study of language models for morphologically rich languages (in particular Lithuanian), neural modelling of language;
  • adaptation of generic language models to the fields of law, medicine, economics, and media;
  • pronunciation modelling research;
  • acoustic modelling research;
  • research on morphological unambiguity;
  • research on automated accentuation techniques;
  • development and research on embedded word vector models for morphologically rich languages;
  • adaptation of embedded word vector models to the fields of law, medicine, economics, and media;
  • recommendations and question-answer systems and their research;
  • aspect-based sentiment analysis;
  • neural recognition of named entities;
  • development and validation of digital resources (audio libraries, speech models, pronunciation models, acoustic models, embedded word vector models, etc.).

Resources (data and datasets). Digital data and datasets are an integral part of the machine learning domain at all stages of the life cycle of research: analysis of data to formulate a research hypothesis, preparation of data as an introductory phase of an experiment, etc. The data and datasets used in ML2 studies and experiments are divided into four groups: initial (textbooks, vocabularies, dictionaries), intermediate, derivative (corpus produced from the initial corpus and segmented according to the specifics of the neural network input used for the study, embedded word vector models from corpora, speech models from corpora, acoustic models from audio materials, etc.), final (neural models of speech, acoustic models, etc.). The specific assignment of a dataset to a group depends on the purpose of the study. For example: (1) an acoustic dataset may be a pre- and post-implementation dataset in a study aiming to investigate the suitability of a new algorithm for a better speech recognition solution; (2) a textual dataset may be a pre- and post-implementation dataset in a study aimed at investigating the suitability of a new algorithm for neural speech modelling (the end result being a neural speech model).

Technologies, tools, and solutions for neural language technologies: machine and deep learning, deep neural networks (recurrent neural networks, two-way recurrent neural networks, transformer-based BERT family and GPT models, etc.), and intelligent technologies.

II. Applications of areas of research and experimental development

ML2 research and experimental development application areas: general language, law, medicine, media (including social networks), economics, etc.