RP4 (Principal Developer: IICT-BAS)

LANGUAGE  AND  CONTENT  TECHNOLOGIES  FOR  BIG  DATA  SOLUTIONS

Nowadays language and content technologies are intensively integrated with big data – language technologies provide tools for translation, interaction and analysis; content technologies enable data analytics tools that support extraction of meaning and semantic patterns from big data while the visualization tools represent the data item dependencies in a user-friendly way. Language and content technologies cover a variety of research topics including natural language processing, machine translation, speech processing, as well as access to and processing of multimedia information, data analytics etc. A common characteristic of the current state of art in these research areas is the attempt to integrate the specific models, developed so far, in a novel, universal and powerful architecture for machine learning based on deep neural networks (called deep learning) that provides effective, easy to adapt and targeted big data processing. Another typical feature of the contemporary intensive penetration of ICT in all spheres of life, which is almost accepted as a standard in the developed countries, is the mass use of multimedia interfaces with localised access in the respective national language and speech interfaces for various computer platforms.

Big Data has become a leading area of research and technology development with the free access to huge volumes of data in many different forms. Availability of Big Data along with the very fast increase of accessible computing power have set new challenges to the language and content technologies aiming at building innovative multilingual data products and services.

1. Language and Content Technologies for Machine Translation

New Machine Translation services will be developed based on Deep Language Processing for the analyses phase (semantics-based parsing such the one based on MRSes) and Deep Learning approach for the transfer phase (neural networks, graph-based approaches). This new theoretical framework will impose new challenges for incorporation of existing language resources into the new framework. The ultimate goal is to extract and structure the encoded knowledge and to bring it to the end users in an appropriate form as content models. A meaningful abstraction mechanism over sets of facts will be developed in order to support a system independent (knowledge level) planning modules for Natural Language Generation over Big Smart Linked Open Data. Further the task will develop a semantics-based multilingual Natural Language Understanding and Generation modules with content feedback with respect to Big Smart Data. The proposed innovative services will be directed to translation for public bodies, language data providers (legal documents, patents, bank information, etc). All research prototypes will be tested extensively for Bulgarian language (among others).

2. Language and Content Technologies for Multimodal and Natural Human-Computer Interaction

One of the research directions is related to new intelligent voice-activated technologies including speech recognition and interpretation of the meanings for delivering more accurate results and enabling novel applications such as personal voice assistants, interactive voice response systems, transcription services, emotion recognizers etc. In this task we shall study unified models of the regularities in and the relationship between the multimodal signals arising in the process of human communication e.g. the interaction of the brain activity, emotional state, facial expression with the produced speech. From application point of view we shall focus on accurate and robust speech recognition as well as on natural and intelligible text-to-speech synthesis enriched with more natural prosody, in respect to the emotional context for Bulgarian and other languages. The activities will include: development of novel methods for language and signal modeling based on machine learning e.g. extensions or alternatives of statistical and/or deep learning models; development of approximate search algorithms as well as development of methods for efficient representation of language and communication models. The created speech technologies will enable the development of original speech processing applications by innovative Bulgarian and European companies and organizations from creative industry.

Another research direction is focused on creation of innovative services oriented to the usage by cultural institutions managing digital cultural heritage resources. This task will be supported by the new opportunities provided by the laboratory for 3D digitalization and the related new methods, algorithms and technologies for image reconstruction, segmentation and processing. It is intended to develop innovative services for automatic image retrieval, classification and clusterization by integration of pattern recognition, data mining and deep learning approaches with web technology.

3. Language and Content Technologies for Better Human Learning & Teaching and Data Analytics

Language and content technologies are powerful tools for improving formal and informal learning in professional, educational and other non-leisure context. Content technologies enable the development of advanced applications that automatically discover knowledge and reveal relationships. The research activities will include development of novel algorithms for analysis of free texts and unstructured data, related to Big Data Analytics; automatic image annotation with multilingual keywords; automatic analysis and secondary use of electronic health records in Bulgarian language; as well as user modelling and discovery of user-behaviour patterns via educational analytics. This research will integrate Natural Language Processing, Machine Learning, Semantic Technologies, Data Mining etc. in the context of Big Data and Advanced Computing.