Synthesized Speech

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6687 Experts worldwide ranked by ideXlab platform

Keiichi Tokuda - One of the best experts on this subject based on the ideXlab platform.

  • the effect of neural networks in statistical parametric Speech synthesis
    International Conference on Acoustics Speech and Signal Processing, 2015
    Co-Authors: Kei Hashimoto, Yoshihiko Nankaku, Keiichiro Oura, Keiichi Tokuda
    Abstract:

    This paper investigates how to use neural networks in statistical parametric Speech synthesis. Recently, deep neural networks (DNNs) have been used for statistical parametric Speech synthesis. However, the specific way how DNNs should be used in statistical parametric Speech synthesis has not been studied thoroughly. A generation process of statistical parametric Speech synthesis based on generative models can be divided into several components, and those components can be represented by DNNs. In this paper, the effect of DNNs for each component is investigated by comparing DNNs with generative models. Experimental results show that the use of a DNN as acoustic models is effective and the parameter generation combined with a DNN improves the naturalness of Synthesized Speech.

  • minimum generation error training with direct log spectral distortion on lsps for hmm based Speech synthesis
    Conference of the International Speech Communication Association, 2008
    Co-Authors: Keiichi Tokuda
    Abstract:

    A minimum generation error (MGE) criterion had been proposed to solve the issues related to maximum likelihood (ML) based HMM training in HMM-based Speech synthesis. In this paper, we improve the MGE criterion by imposing a log spectral distortion (LSD) instead of the Euclidean distance to define the generation error between the original and generated line spectral pair (LSP) coefficients. Moreover, we investigate the effect of different sampling strategies to calculate the integration of the LSD function. From the experimental results, using the LSDs calculated by sampling at LSPs achieved the best performance, and the quality of Synthesized Speech after the MGE-LSD training was improved over the original MGE training.

  • an excitation model for hmm based Speech synthesis based on residual modeling
    SSW, 2007
    Co-Authors: Ranniery Maia, Heiga Zen, Tomoki Toda, Yoshihiko Nankaku, Keiichi Tokuda
    Abstract:

    This paper describes a trainable excitation approach to eliminate the unnaturalness of HMM-based Speech synthesizers. During the waveform generation part, mixed excitation is constructed by state-dependent filtering of pulse trains and white noise sequences. In the training part, filters and pulse trains are jointly optimized through a procedure which resembles analysis-bysynthesis Speech coding algorithms, where likelihood maximization of residual signals (derived from the same database which is used to train the HMM-based synthesizer) is pursued. Preliminary results show that the novel excitation model in question eliminates the unnaturalness of Synthesized Speech, being comparable in quality to the the best approaches thus far reported to eradicate the buzziness of HMM-based synthesizers.

  • a hidden semi markov model based Speech synthesis system
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Keiichi Tokuda, Takao Kobayasih, Takashi Masuko, Tadashi Kitamura
    Abstract:

    A statistical Speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of Speech are modeled simultaneously by context-dependent HMMs, and Speech parameter vector sequences are generated from the HMMs themselves. This system defines a Speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the Synthesized Speech sound less natural. In this paper, we propose a statistical Speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of Synthesized Speech.

  • hidden semi markov model based Speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based Speech synthesis system is proposed. In a hidden Markov model (HMM) based Speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis Speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a Speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the Speech is Synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the Synthesized Speech.

Honggoo Kang - One of the best experts on this subject based on the ideXlab platform.

  • excitnet vocoder a neural excitation model for parametric Speech synthesis systems
    European Signal Processing Conference, 2019
    Co-Authors: Eunwoo Song, Kyungguen Byun, Honggoo Kang
    Abstract:

    This paper proposes a WaveNet-based neural excitation model (ExcitNet) for statistical parametric Speech synthesis systems. Conventional WaveNet-based neural vocoding systems significantly improve the perceptual quality of Synthesized Speech by statistically generating a time sequence of Speech waveforms through an auto-regressive framework. However, they often suffer from noisy outputs because of the difficulties in capturing the complicated time-varying nature of Speech signals. To improve modeling efficiency, the proposed ExcitNet vocoder employs an adaptive inverse filter to decouple spectral components from the Speech signal. The residual component, i.e. excitation signal, is then trained and generated within the WaveNet framework. In this way, the quality of the Synthesized Speech signal can be further improved since the spectral component is well represented by a deep learning framework and, moreover, the residual component is efficiently generated by the WaveNet framework. Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet vocoders.

  • Emotional Speech Synthesis Based on Style Embedded Tacotron2 Framework
    2019 34th International Technical Conference on Circuits Systems Computers and Communications (ITC-CSCC), 2019
    Co-Authors: Ohsung Kwon, Inseon Jang, Honggoo Kang
    Abstract:

    In this paper, we propose a Speech synthesis system that effectively generates multiple types of emotional Speech using the concept of global style token (GST); where the emotion-related style information is presented by an additional style embedding vector. Although the GST is not a new idea, no one has been utilized the idea for an emotional Speech synthesis task. We explicitly combine the GST idea with the Tacotron2 framework to implement an emotional text-to-Speech system. The analysis results demonstrate that the proposed GST structure successfully transfers various types of emotional information to the Synthesized Speech. Subjective listening tests to evaluate the naturalness and emotional expression of Synthesized Speech are conducted to verify the superiority of the proposed algorithm.

  • improved time frequency trajectory excitation modeling for a statistical parametric Speech synthesis system
    International Conference on Acoustics Speech and Signal Processing, 2015
    Co-Authors: Eunwoo Song, Honggoo Kang
    Abstract:

    This paper proposes an improved time-frequency trajectory excitation (TFTE) modeling method for a statistical parametric Speech synthesis system. The proposed approach overcomes the dimensional variation problem of the training process caused by the inherent nature of the pitch-dependent analysis paradigm. By reducing the redundancies of the parameters using predicted average block coefficients (PABC), the proposed algorithm efficiently models excitation, even if its dimension is varied. Objective and subjective test results verify that the proposed algorithm provides not only robustness to the training process but also naturalness to the Synthesized Speech.

  • waveform interpolation based Speech analysis synthesis for hmm based tts systems
    IEEE Signal Processing Letters, 2012
    Co-Authors: Chisang Jung, Youngsun Joo, Honggoo Kang
    Abstract:

    This letter proposes an HMM-based Text-to-Speech (TTS) system using waveform interpolation (WI)-based Speech analysis and synthesis. The Synthesized Speech quality of the proposed system is significantly improved due to adopting an enhanced excitation modeling technique. The decomposition of characteristic waveform (CW) into slowly evolving waveform (SEW) and rapidly evolving waveform (REW) is efficient not only for excitation modeling but also for training process of HMMs. Objective and subjective test results verify the superiority of the proposed approach to conventional ones.

Takashi Masuko - One of the best experts on this subject based on the ideXlab platform.

  • a hidden semi markov model based Speech synthesis system
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Keiichi Tokuda, Takao Kobayasih, Takashi Masuko, Tadashi Kitamura
    Abstract:

    A statistical Speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of Speech are modeled simultaneously by context-dependent HMMs, and Speech parameter vector sequences are generated from the HMMs themselves. This system defines a Speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the Synthesized Speech sound less natural. In this paper, we propose a statistical Speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of Synthesized Speech.

  • Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing
    IEICE Transactions on Information and Systems, 2005
    Co-Authors: Makoto Tachibana, Junichi Yamagishi, Takashi Masuko, Takao Kobayashi
    Abstract:

    This paper describes an approach to generating Speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based Speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based Speech synthesis framework. Then, to generate synthetic Speech with an intermediate style from representative ones, we synthesize Speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read Speech and Synthesized Speech from models obtained by interpolating models for all combinations of two styles. The results show that Speech Synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in Synthesized Speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in Speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

  • hidden semi markov model based Speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based Speech synthesis system is proposed. In a hidden Markov model (HMM) based Speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis Speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a Speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the Speech is Synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the Synthesized Speech.

  • mixed excitation for hmm based Speech synthesis
    Conference of the International Speech Communication Association, 2001
    Co-Authors: Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    This paper describes improvements on the excitation model of an HMM-based text-to-Speech system. In our previous work, natural sounding Speech can be Synthesized from trained HMMs. However, it has a typical quality of “vocoded Speech” since the system uses a traditional excitation model with either a periodic impulse train or white noise. In this paper, in order to reduce the synthetic quality, a mixed excitation model used in MELP is incorporated into the system. Excitation parameters used in mixed excitation are modeled by HMMs, and generated from HMMs by a parameter generation algorithm in the synthesis phase. The result of a listening test shows that the mixed excitation model significantly improves quality of Synthesized Speech as compared with the traditional excitation model.

  • voice characteristics conversion for hmm based Speech synthesis system
    International Conference on Acoustics Speech and Signal Processing, 1997
    Co-Authors: Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Satoshi Imai
    Abstract:

    We describe an approach to voice characteristics conversion for an HMM-based text-to-Speech synthesis system. Since this Speech synthesis system uses phoneme HMMs as Speech units, voice characteristics conversion is achieved by changing the HMM parameters appropriately. To transform the voice characteristics of Synthesized Speech to the target speaker, we applied the maximum a posteriori estimation and vector field smoothing (MAP/VFS) algorithm to the phoneme HMMs. Using 5 or 8 sentences as adaptation data, Speech samples Synthesized from a set of adapted tied triphone HMMs, which have approximately 2,000 distributions, are judged to be closer to the target speaker by 79.7% or 90.6%, respectively, in an ABX listening test.

Tadashi Kitamura - One of the best experts on this subject based on the ideXlab platform.

  • a hidden semi markov model based Speech synthesis system
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Keiichi Tokuda, Takao Kobayasih, Takashi Masuko, Tadashi Kitamura
    Abstract:

    A statistical Speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of Speech are modeled simultaneously by context-dependent HMMs, and Speech parameter vector sequences are generated from the HMMs themselves. This system defines a Speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the Synthesized Speech sound less natural. In this paper, we propose a statistical Speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of Synthesized Speech.

  • hidden semi markov model based Speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based Speech synthesis system is proposed. In a hidden Markov model (HMM) based Speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis Speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a Speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the Speech is Synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the Synthesized Speech.

  • mixed excitation for hmm based Speech synthesis
    Conference of the International Speech Communication Association, 2001
    Co-Authors: Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    This paper describes improvements on the excitation model of an HMM-based text-to-Speech system. In our previous work, natural sounding Speech can be Synthesized from trained HMMs. However, it has a typical quality of “vocoded Speech” since the system uses a traditional excitation model with either a periodic impulse train or white noise. In this paper, in order to reduce the synthetic quality, a mixed excitation model used in MELP is incorporated into the system. Excitation parameters used in mixed excitation are modeled by HMMs, and generated from HMMs by a parameter generation algorithm in the synthesis phase. The result of a listening test shows that the mixed excitation model significantly improves quality of Synthesized Speech as compared with the traditional excitation model.

Takao Kobayashi - One of the best experts on this subject based on the ideXlab platform.

  • Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing
    IEICE Transactions on Information and Systems, 2005
    Co-Authors: Makoto Tachibana, Junichi Yamagishi, Takashi Masuko, Takao Kobayashi
    Abstract:

    This paper describes an approach to generating Speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based Speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based Speech synthesis framework. Then, to generate synthetic Speech with an intermediate style from representative ones, we synthesize Speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read Speech and Synthesized Speech from models obtained by interpolating models for all combinations of two styles. The results show that Speech Synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in Synthesized Speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in Speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

  • hidden semi markov model based Speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based Speech synthesis system is proposed. In a hidden Markov model (HMM) based Speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis Speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a Speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the Speech is Synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the Synthesized Speech.

  • mixed excitation for hmm based Speech synthesis
    Conference of the International Speech Communication Association, 2001
    Co-Authors: Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    This paper describes improvements on the excitation model of an HMM-based text-to-Speech system. In our previous work, natural sounding Speech can be Synthesized from trained HMMs. However, it has a typical quality of “vocoded Speech” since the system uses a traditional excitation model with either a periodic impulse train or white noise. In this paper, in order to reduce the synthetic quality, a mixed excitation model used in MELP is incorporated into the system. Excitation parameters used in mixed excitation are modeled by HMMs, and generated from HMMs by a parameter generation algorithm in the synthesis phase. The result of a listening test shows that the mixed excitation model significantly improves quality of Synthesized Speech as compared with the traditional excitation model.

  • voice characteristics conversion for hmm based Speech synthesis system
    International Conference on Acoustics Speech and Signal Processing, 1997
    Co-Authors: Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Satoshi Imai
    Abstract:

    We describe an approach to voice characteristics conversion for an HMM-based text-to-Speech synthesis system. Since this Speech synthesis system uses phoneme HMMs as Speech units, voice characteristics conversion is achieved by changing the HMM parameters appropriately. To transform the voice characteristics of Synthesized Speech to the target speaker, we applied the maximum a posteriori estimation and vector field smoothing (MAP/VFS) algorithm to the phoneme HMMs. Using 5 or 8 sentences as adaptation data, Speech samples Synthesized from a set of adapted tied triphone HMMs, which have approximately 2,000 distributions, are judged to be closer to the target speaker by 79.7% or 90.6%, respectively, in an ABX listening test.