The Experts below are selected from a list of 9159 Experts worldwide ranked by ideXlab platform
Li Deng - One of the best experts on this subject based on the ideXlab platform.
-
Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends
IEEE Signal Processing Magazine, 2015Co-Authors: Zhen-hua Ling, Shi Yin Kang, Mike Schuster, Xiao Jun Qian, Andrew Senior, Helen Meng, Heiga Zen, Li DengAbstract:Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level Speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the Speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of Human Speech Production and by the successful application of deep neural networks (DNNs) to automatic Speech recognition (ASR), deep learning techniques have also been applied successfully to Speech generation, as reported in recent literature.
-
a dynamic feature based approach to the interface between phonology and phonetics for Speech modeling and recognition
Speech Communication, 1998Co-Authors: Li DengAbstract:Abstract An overview of a statistical paradigm for Speech recognition is given where phonetic and phonological knowledge sources, drawn from the current understanding of the global characteristics of Human Speech communication, are seamlessly integrated into the structure of a stochastic model of Speech. A consistent statistical formalism is presented in which the submodels for the discrete, feature-based phonological process and the continuous, dynamic phonetic process in Human Speech Production are computationally interfaced. This interface enables global optimization of a parsimonious set of model parameters that accurately characterize the symbolic, dynamic, and static components in Speech Production and explicitly separates distinct sources of the Speech variability observable at the acoustic level. The formalism is founded on a rigorous mathematical basis, encompassing computational phonology, Bayesian analysis and statistical estimation theory, nonstationary time series and dynamic system theory, and nonlinear function approximation (neural network) theory. Two principal ways of implementing the Speech model and recognizer are presented, one based on the trended hidden Markov model (HMM) or explicitly defined trajectory model, and the other on the state-space or recursively defined trajectory model. Both implementations build into their respective recognition and model-training algorithms a continuity constraint on the internal, Production-affiliated trajectories across feature-defined phonological units. The continuity and the parameterized structure in the dynamic Speech model permit a joint characterization of the contextual and speaking-style variations manifested in Speech acoustics, thereby holding promises to overcome some key limitations of the current Speech recognition technology
Shrikanth S Narayanan - One of the best experts on this subject based on the ideXlab platform.
-
speed accuracy tradeoffs in Human Speech Production
PLOS ONE, 2018Co-Authors: Adam C Lammert, Shrikanth S Narayanan, Christine H Shadle, Thomas F QuatieriAbstract:Speech motor actions are performed quickly, while simultaneously maintaining a high degree of accuracy. Are speed and accuracy in conflict during Speech Production? Speed-accuracy tradeoffs have been shown in many domains of Human motor action, but have not been directly examined in the domain of Speech Production. The present work seeks evidence for Fitts’ law, a rigorous formulation of this fundamental tradeoff, in Speech articulation kinematics by analyzing USC-TIMIT, a real-time magnetic resonance imaging data set of Speech Production. A theoretical framework for considering Fitts’ law with respect to models of Speech motor control is elucidated. Methodological challenges in seeking relationships consistent with Fitts’ law are addressed, including the operational definitions and measurement of key variables in real-time MRI data. Results suggest the presence of speed-accuracy tradeoffs for certain types of Speech Production actions, with wide variability across syllable position, and substantial variability also across subjects. Coda consonant targets immediately following the syllabic nucleus show the strongest evidence of this tradeoff, with correlations as high as 0.72 between speed and accuracy. A discussion is provided concerning the potentially limited applicability of Fitts’ law in the context of Speech Production, as well as the theoretical context for interpreting the results.
-
spatial and temporal alignment of multimodal Human Speech Production data real time imaging flesh point tracking and audio
International Conference on Acoustics Speech and Signal Processing, 2013Co-Authors: Jangwon Kim, Adam C Lammert, Prasanta Kumar Ghosh, Shrikanth S NarayananAbstract:In Speech Production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.
-
processing Speech signal using auditory like filterbank provides least uncertainty about articulatory gestures
Journal of the Acoustical Society of America, 2011Co-Authors: Prasanta Kumar Ghosh, Louis Goldstein, Shrikanth S NarayananAbstract:Understanding how the Human Speech Production system is related to the Human auditory system has been a perennial subject of inquiry. To investigate the Production–perception link, in this paper, a computational analysis has been performed using the articulatory movement data obtained during Speech Production with concurrently recorded acoustic Speech signals from multiple subjects in three different languages: English, Cantonese, and Georgian. The form of articulatory gestures during Speech Production varies across languages, and this variation is considered to be reflected in the articulatory position and kinematics. The auditory processing of the acoustic Speech signal is modeled by a parametric representation of the cochlear filterbank which allows for realizing various candidate filterbank structures by changing the parameter value. Using mathematical communication theory, it is found that the uncertainty about the articulatory gestures in each language is maximally reduced when the acoustic Speech signal is represented using the output of a filterbank similar to the empirically established cochlear filterbank in the Human auditory system. Possible interpretations of this finding are discussed.
-
data driven analysis of realtime vocal tract mri using correlated image regions
Conference of the International Speech Communication Association, 2010Co-Authors: Adam C Lammert, Michael Proctor, Shrikanth S NarayananAbstract:Realtime MRI provides useful data about the Human vocal tract, but also introduces many of the challenges of processing highdimensional image data. Intuitively, data reduction would proceed by finding the air-tissue boundaries in the images, and tracing an outline of the vocal tract. This approach is anatomically well-founded. We explore an alternative approach which is data-driven and has a complementary set of advantages. Our method directly examines pixel intensities. By analyzing how the pixels co-vary over time, we segment the image into spatially localized regions, in which the pixels are highly correlated with each other. Intensity variations in these correlated regions correspond to vocal tract constrictions, which are meaningful units of Speech Production. We show how these regions can be extracted entirely automatically, or with manual guidance. We present two examples and discuss its merits, including the opportunity to do direct data-driven time series modeling. Index Terms: Human Speech Production, phonetics, realtime mri, vocal tract, data reduction
-
seeing Speech capturing vocal tract shaping using real time magnetic resonance imaging
2008Co-Authors: Erik Bresch, Yoonchul Kim, Krishna S Nayak, Dani Byrd, Shrikanth S NarayananAbstract:Understanding Human Speech Production is of great interest from engineering, linguistic, and several other research points of view. While several types of data available to Speech understanding studies lead to different avenues for research, in this article we focus on real-time (RT) magnetic resonance imaging (MRI) as an emerging technique for studying Speech Production. We discuss the details and challenges of RT magnetic resonance (MR) acquisition and analysis, and modeling approaches that make use of MRI data for studying Speech Production. MOTIVATION From an engineer’s point of view, detailed knowledge about Speech Production gives rise to refined models for the Speech signal that can be exploited for the design of powerful Speech recognition, coding, and synthesis systems. From a linguist’s point of view, Speech research may be conducted to address open questions in the areas of phonetics and phonology. These include: 1) what articulatory mechanisms explain the inter- and intrasubject variability of Speech, 2) what aspects of the vocal tract shaping are critically controlled by the brain for conveying meaning and emotions, and 3) how does prosody affect the articulatory timing. From other research points of view, Speech Production is important to understand language acquisition and language disorders. All of these efforts require intimate knowledge of the Speech generation mechanisms. Different types of data are available to the Speech researcher—from audio and video recordings of Speech Production to muscle activity data produced by electromyography, respiratory data from subglottal or interoral pressure transduction, and images of the larynx obtained through video laryngoscopy. While the vocal tract posture and movement can be investigated using a host of techniques summarized in Table 1 including X ray (microbeam), cinefluography, ultrasound, palatography, electromagnetometry (EMA), RT-MRI has a particular advantage in that it produces complete views of the entire vocal tract including the pharyngeal structures in a safe and noninvasive manner. With RT-MRI, a midsaggital image of the vocal tract from the glottis (bottom) to the lips (left) can be acquired as illustrated in Figure 1(a). In this image, we can trace the air-tissue boundaries of the anatomical components that are of interest to the Speech researcher and obtain a representation similar to Figure 1(b). These components, also known as articulators, are controlled by the brain during Speech Production and are used to change the shape of the vocal tract tube. With it, they also change the filter function for the excitation signal generated at the glottis and elsewhere along the airway. Hence the motion of the articulators shapes the sounds of Speech and other Human vocalizations. The signal processing challenges when studying these using RT-MRI lie in the fast acquisition of high-quality RT MRI images including simultaneous noise-robust audio recording [1], the subsequent detection of the relevant features from each image, and the analysis and modeling of the time-varying vocal tract shape for the purpose of gaining deeper understanding of the underlying principles that govern the Speech Production process.
Roderick A Suthers - One of the best experts on this subject based on the ideXlab platform.
-
vocal tract articulation revisited the case of the monk parakeet
The Journal of Experimental Biology, 2012Co-Authors: Verena R Ohms, Gabriel J L Beckers, Carel Ten Cate, Roderick A SuthersAbstract:SUMMARY Birdsong and Human Speech share many features with respect to vocal learning and development. However, the vocal Production mechanisms have long been considered to be distinct. The vocal organ of songbirds is more complex than the Human larynx, leading to the hypothesis that vocal variation in birdsong originates mainly at the sound source, while in Humans it is primarily due to vocal tract filtering. However, several recent studies have indicated the importance of vocal tract articulators such as the beak and oropharyngeal–esophageal cavity. In contrast to most other bird groups, parrots have a prominent tongue, raising the possibility that tongue movements may also be of significant importance in vocal Production in parrots, but evidence is rare and observations often anecdotal. In the current study we used X-ray cinematographic imaging of naturally vocalizing monk parakeets ( Myiopsitta monachus ) to assess which articulators are possibly involved in vocal tract filtering in this species. We observed prominent tongue height changes, beak opening movements and tracheal length changes, which suggests that all of these components play an important role in modulating vocal tract resonance. Moreover, the observation of tracheal shortening as a vocal articulator in live birds has to our knowledge not been described before. We also found strong positive correlations between beak opening and amplitude as well as changes in tongue height and amplitude in several types of vocalization. Our results suggest considerable differences between parrot and songbird vocal Production while at the same time the parrot9s vocal articulation might more closely resemble Human Speech Production in the sense that both make extensive use of the tongue as a vocal articulator.
-
vocal tract filtering by lingual articulation in a parrot
Current Biology, 2004Co-Authors: Gabriel J L Beckers, Brian S Nelson, Roderick A SuthersAbstract:Human Speech and bird vocalization are complex communicative behaviors with notable similarities in development and underlying mechanisms. However, there is an important difference between Humans and birds in the way vocal complexity is generally produced. Human Speech originates from independent modulatory actions of a sound source, e.g., the vibrating vocal folds, and an acoustic filter, formed by the resonances of the vocal tract (formants). Modulation in bird vocalization, in contrast, is thought to originate predominantly from the sound source, whereas the role of the resonance filter is only subsidiary in emphasizing the complex time-frequency patterns of the source (e.g., but see ). However, it has been suggested that, analogous to Human Speech Production, tongue movements observed in parrot vocalizations modulate formant characteristics independently from the vocal source. As yet, direct evidence of such a causal relationship is lacking. In five Monk parakeets, Myiopsitta monachus, we replaced the vocal source, the syrinx, with a small speaker that generated a broad-band sound, and we measured the effects of tongue placement on the sound emitted from the beak. The results show that tongue movements cause significant frequency changes in two formants and cause amplitude changes in all four formants present between 0.5 and 10 kHz. We suggest that lingual articulation may thus in part explain the well-known ability of parrots to mimic Human Speech, and, even more intriguingly, may also underlie a Speech-like formant system in natural parrot vocalizations.
-
pure tone birdsong by resonance filtering of harmonic overtones
Proceedings of the National Academy of Sciences of the United States of America, 2003Co-Authors: Gabriel J L Beckers, Roderick A Suthers, Carel Ten CateAbstract:Pure-tone song is a common and widespread phenomenon in birds. The mechanistic origin of this type of phonation has been the subject of long-standing discussion. Currently, there are three hypotheses. (i) A vibrating valve in the avian vocal organ, the syrinx, generates a multifrequency harmonic source sound, which is filtered to a pure tone by a vocal tract filter ("source-filter" model, analogous to Human Speech Production). (ii) Vocal tract resonances couple with a vibrating valve source, suppressing the normal Production of harmonic overtones at this source ("soprano" model, analogous to Human soprano singing). (iii) Pure-tone sound is produced as such by a sound-generating mechanism that is fundamentally different from a vibrating valve. Here we present direct evidence of a source-filter mechanism in the Production of pure-tone birdsong. Using tracheal thermistors and air sac pressure cannulae, we recorded sound signals close to the syringeal sound source during spontaneous, pure-tone vocalizations of two species of turtledove. The results show that pure-tone dove vocalizations originate through filtering of a multifrequency harmonic sound source.
Gabriel J L Beckers - One of the best experts on this subject based on the ideXlab platform.
-
vocal tract articulation revisited the case of the monk parakeet
The Journal of Experimental Biology, 2012Co-Authors: Verena R Ohms, Gabriel J L Beckers, Carel Ten Cate, Roderick A SuthersAbstract:SUMMARY Birdsong and Human Speech share many features with respect to vocal learning and development. However, the vocal Production mechanisms have long been considered to be distinct. The vocal organ of songbirds is more complex than the Human larynx, leading to the hypothesis that vocal variation in birdsong originates mainly at the sound source, while in Humans it is primarily due to vocal tract filtering. However, several recent studies have indicated the importance of vocal tract articulators such as the beak and oropharyngeal–esophageal cavity. In contrast to most other bird groups, parrots have a prominent tongue, raising the possibility that tongue movements may also be of significant importance in vocal Production in parrots, but evidence is rare and observations often anecdotal. In the current study we used X-ray cinematographic imaging of naturally vocalizing monk parakeets ( Myiopsitta monachus ) to assess which articulators are possibly involved in vocal tract filtering in this species. We observed prominent tongue height changes, beak opening movements and tracheal length changes, which suggests that all of these components play an important role in modulating vocal tract resonance. Moreover, the observation of tracheal shortening as a vocal articulator in live birds has to our knowledge not been described before. We also found strong positive correlations between beak opening and amplitude as well as changes in tongue height and amplitude in several types of vocalization. Our results suggest considerable differences between parrot and songbird vocal Production while at the same time the parrot9s vocal articulation might more closely resemble Human Speech Production in the sense that both make extensive use of the tongue as a vocal articulator.
-
vocal tract filtering by lingual articulation in a parrot
Current Biology, 2004Co-Authors: Gabriel J L Beckers, Brian S Nelson, Roderick A SuthersAbstract:Human Speech and bird vocalization are complex communicative behaviors with notable similarities in development and underlying mechanisms. However, there is an important difference between Humans and birds in the way vocal complexity is generally produced. Human Speech originates from independent modulatory actions of a sound source, e.g., the vibrating vocal folds, and an acoustic filter, formed by the resonances of the vocal tract (formants). Modulation in bird vocalization, in contrast, is thought to originate predominantly from the sound source, whereas the role of the resonance filter is only subsidiary in emphasizing the complex time-frequency patterns of the source (e.g., but see ). However, it has been suggested that, analogous to Human Speech Production, tongue movements observed in parrot vocalizations modulate formant characteristics independently from the vocal source. As yet, direct evidence of such a causal relationship is lacking. In five Monk parakeets, Myiopsitta monachus, we replaced the vocal source, the syrinx, with a small speaker that generated a broad-band sound, and we measured the effects of tongue placement on the sound emitted from the beak. The results show that tongue movements cause significant frequency changes in two formants and cause amplitude changes in all four formants present between 0.5 and 10 kHz. We suggest that lingual articulation may thus in part explain the well-known ability of parrots to mimic Human Speech, and, even more intriguingly, may also underlie a Speech-like formant system in natural parrot vocalizations.
-
pure tone birdsong by resonance filtering of harmonic overtones
Proceedings of the National Academy of Sciences of the United States of America, 2003Co-Authors: Gabriel J L Beckers, Roderick A Suthers, Carel Ten CateAbstract:Pure-tone song is a common and widespread phenomenon in birds. The mechanistic origin of this type of phonation has been the subject of long-standing discussion. Currently, there are three hypotheses. (i) A vibrating valve in the avian vocal organ, the syrinx, generates a multifrequency harmonic source sound, which is filtered to a pure tone by a vocal tract filter ("source-filter" model, analogous to Human Speech Production). (ii) Vocal tract resonances couple with a vibrating valve source, suppressing the normal Production of harmonic overtones at this source ("soprano" model, analogous to Human soprano singing). (iii) Pure-tone sound is produced as such by a sound-generating mechanism that is fundamentally different from a vibrating valve. Here we present direct evidence of a source-filter mechanism in the Production of pure-tone birdsong. Using tracheal thermistors and air sac pressure cannulae, we recorded sound signals close to the syringeal sound source during spontaneous, pure-tone vocalizations of two species of turtledove. The results show that pure-tone dove vocalizations originate through filtering of a multifrequency harmonic sound source.
Adam C Lammert - One of the best experts on this subject based on the ideXlab platform.
-
speed accuracy tradeoffs in Human Speech Production
PLOS ONE, 2018Co-Authors: Adam C Lammert, Shrikanth S Narayanan, Christine H Shadle, Thomas F QuatieriAbstract:Speech motor actions are performed quickly, while simultaneously maintaining a high degree of accuracy. Are speed and accuracy in conflict during Speech Production? Speed-accuracy tradeoffs have been shown in many domains of Human motor action, but have not been directly examined in the domain of Speech Production. The present work seeks evidence for Fitts’ law, a rigorous formulation of this fundamental tradeoff, in Speech articulation kinematics by analyzing USC-TIMIT, a real-time magnetic resonance imaging data set of Speech Production. A theoretical framework for considering Fitts’ law with respect to models of Speech motor control is elucidated. Methodological challenges in seeking relationships consistent with Fitts’ law are addressed, including the operational definitions and measurement of key variables in real-time MRI data. Results suggest the presence of speed-accuracy tradeoffs for certain types of Speech Production actions, with wide variability across syllable position, and substantial variability also across subjects. Coda consonant targets immediately following the syllabic nucleus show the strongest evidence of this tradeoff, with correlations as high as 0.72 between speed and accuracy. A discussion is provided concerning the potentially limited applicability of Fitts’ law in the context of Speech Production, as well as the theoretical context for interpreting the results.
-
spatial and temporal alignment of multimodal Human Speech Production data real time imaging flesh point tracking and audio
International Conference on Acoustics Speech and Signal Processing, 2013Co-Authors: Jangwon Kim, Adam C Lammert, Prasanta Kumar Ghosh, Shrikanth S NarayananAbstract:In Speech Production research, the integration of articulatory data derived from multiple measurement modalities can provide rich description of vocal tract dynamics by overcoming the limited spatio-temporal representations offered by individual modalities. This paper presents a spatial and temporal alignment method between two promising modalities using a corpus of TIMIT sentences obtained from the same speaker: flesh point tracking from Electromagnetic Articulography (EMA) that offers high temporal resolution but sparse spatial information and real time Magnetic Resonance Imaging (MRI) that offers good spatial details but at lower temporal rates. Spatial alignment is done by using palate tracking of EMA, but distortion in MRI audio and articulatory data variability make temporal alignment challenging. This paper proposes a novel alignment technique using joint acoustic-articulatory features which combines dynamic time warping and automatic feature extraction from MRI images. Experimental results show that the temporal alignment obtained using this technique is better (12% relative) than that using acoustic feature only.
-
data driven analysis of realtime vocal tract mri using correlated image regions
Conference of the International Speech Communication Association, 2010Co-Authors: Adam C Lammert, Michael Proctor, Shrikanth S NarayananAbstract:Realtime MRI provides useful data about the Human vocal tract, but also introduces many of the challenges of processing highdimensional image data. Intuitively, data reduction would proceed by finding the air-tissue boundaries in the images, and tracing an outline of the vocal tract. This approach is anatomically well-founded. We explore an alternative approach which is data-driven and has a complementary set of advantages. Our method directly examines pixel intensities. By analyzing how the pixels co-vary over time, we segment the image into spatially localized regions, in which the pixels are highly correlated with each other. Intensity variations in these correlated regions correspond to vocal tract constrictions, which are meaningful units of Speech Production. We show how these regions can be extracted entirely automatically, or with manual guidance. We present two examples and discuss its merits, including the opportunity to do direct data-driven time series modeling. Index Terms: Human Speech Production, phonetics, realtime mri, vocal tract, data reduction