Equalizing Speaker, Speaking Style and Environment-Induced Variability in Audio Streams for Robust Speech Engines
As the speech technology matures, the requirements on speech-enabled interfaces expand from simplistic tasks in controlled environments to naturalistic, human-human like interactions. This demand is fueled by the ever increasing and ubiquitous presence of technology that requires human input, be it smartphones, car navigation systems, automated information kiosks, or home appliances. The rising expectations come hand-in-hand with obstacles the designers of speech engines need to address. One major challenge is the input audio stream variability introduced by various speaker and environment-related factors. Inter- and intra-speaker variability together with changing background noise and acoustic characteristics of the environment may drive an acoustic model trained on mismatched conditions completely helpless. Multi-style training is a popular approach that addresses the issue of variability by exposing the acoustic model to a large variety of conditions during its training stage. This approach is powerful but comes with strong assumptions that a large pool of data covering a variety of conditions/speaking styles are available, ideally with some level of transcription, and that there is access to corresponding computational resources that will be able to train acoustic models on such data. In some instances, the multi-style training assumptions are difficult or impossible to fulfill (e.g., dealing with previously unseen acoustics/noise/unusual speaking style). An example can be the recognition of whispered speech where a majority of commercially available corpora does not capture this speech modality. One way of increasing robustness of speech engines to unseen conditions is to apply condition-adaptive normalization techniques that will transfer the acoustic streams into a space familiar to the acoustic model. The present talk will analyze the interactions of speech production features with various talking styles and environments and present several feature normalization strategies that help alleviate the impact of these variation on speech engines.
12th July, 2018 PM 4:00
412 Room, 3rd building, SEIEE, Shanghai Jiaotong University, 800 Dongchuan RD. Minhang District, Shanghai, China
Hynek Boril was born in Most, Czech Republic. He received the M.S. degree in electrical engineering and Ph.D. degree in electrical engineering and information technology from the Department of Electrical Engineering, Czech Technical University in Prague, Czech Republic, in 2003 and 2008, respectively. In August 2007, he joined the Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering and Computer Science, University of Texas at Dallas (UT), Richardson, TX, USA, as a Research Associate and in 2012 as an Assistant Research Professor. Since August 2015, he has been an Assistant Professor in the Electrical Engineering Department, University of Wisconsin–Platteville (UW), USA, and an Adjunct Assistant Research Professor at CRSS, UT-Dallas. At UW-Platteville, he established the Pioneer Speech Signal Processing Laboratory whose mission is to engage undergraduate students in research on speech technologies and connect them with graduate institutions and industry. He has authored/co-authored more than 60 journal and conference papers. His research interests include the areas of digital signal processing, acoustic signal modeling, and machine learning, with the focus on automatic speech and speaker recognition, language and dialect identification, stress, emotion and cognitive load classification, automatic assessment of physiological traits from speech signals, robustness to environmental and speaker-induced variability, and language acquisition in infants.
Hynek Boril served on the Organizing Committee of the Listening Talker (LISTA) Workshop on Natural and Synthetic Modification of Speech in Response to Listening Conditions (Edinburgh, UK, 2012) and on the Editorial Advisory Board of the book 'Technologies for Inclusive Education: Beyond Traditional Integration Approaches' (Eds. D. Griol, Z. Callejas, R. L. Cozar) IGI Global, 2012. He was as an external reviewer for the Ministry of Business, Innovation, and Employment of New Zealand (MBIE) and an independent expert in two patent infringement cases in the field of automatic speech and speaker recognition, two patent validity reexamination cases in the field of automatic speech recognition, and a voice forensics case.