AI support to overcome perceptual, cognitive and language barriers

Supervisor:  Alex Waibel

Faculty:  Informatics

Problem statement:  
In recent years, advanced systems that automatically transcribe, translate and synthesize human dialog have been developed at University Laboratories (KIT, CMU, ..) and demonstrated to provide practical assistance to overcome language barriers.  These research systems have also been transitioned to broad practical deployment in lectures (KIT Lecture Translator), Video Conferencing (Zoom, …), humanitarian missions (Relater), and European Parliament (EP, EU-Bridge,..).  But significant challenges remain to truly connect all people in a frictionless interactive experience, particularly when additional cognitive and perceptual challenges further complicate interaction.  Current technology delivers in real-time automatic subtitling, and translation in textual form, but for many participants with reading disabilities, auditory or visual impairments, educational or lexical limitations, following a fast textual transcript still presents too significant a challenge for free frictionless engagement with others.

Project:
In this project, we aim to go beyond state-of-the art of simultaneous speech translation system to address these challenges:

  • Improve readability of text transcripts and translation:  better structuring, highlighting, punctuation, segmentation, and multimodal integration to make following/reading presentations texts easier to read.  This is particularly important (and yet unsolved) for conversational speech in multi-party meetings, when fast turn-taking, disfluencies, spontaneous speech, and incomplete sentence fragments make reading comprehension in real-time difficult or impossible
  • Summarization:  Reduction of transcribed and translated text to better capture the essence of speech (particularly when it is conversational), considerably eases comprehension.  Such methods will also be interactive so that suitable summaries can be customized by the user.
  • Text Simplification:  When presentations use complex technical terms, jargon or lengthy discussion, a text should be translated to simpler texts, simpler words, more basic colloquial texts, that are accessible to a reader with more limited reading abilities (e.g. adult-children, expert-novice, native speaker – novice learner). Machine translation methods can be applied to perform this transformation
  • Multimodal input/output:  not only speech and text must be considered.  The combination of text and speech synthesis, speech and text with emphasis, visual highlighting of text, input or output generation of visual scenes, automatic cross referencing to related material, automatic integration of audio-visual material, must be integrated as they provide richer, complementary dimensions that will significantly enhance participants’ comprehension depending on the participant.


Desired qualification of the PhD student:

  • University degree (M.Sc.) in Computer Science or Electrical Engineering.  Excellent academic record and excellent grades.
  • Deep understanding and knowledge of machine learning, neural networks and perceptual interface methods and technologies
  • Demonstrated interest or experience with speech, image, language processing and concepts and  pertinent software tools.
  • Strong coding skills in at least one programming language and scripting language (preferably C++, C, Python).  Experience with app development, javascript or equivalent web presentation also encouraged/preferred