State Duration

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 125274 Experts worldwide ranked by ideXlab platform

Keiichi Tokuda - One of the best experts on this subject based on the ideXlab platform.

  • ICASSP - Separable lattice 2-D HMMS introducing State Duration control for recognition of images with various variations
    2013 IEEE International Conference on Acoustics Speech and Signal Processing, 2013
    Co-Authors: Takaya Makino, Yoshihiko Nankaku, Shinji Takaki, Kei Hashimoto, Keiichi Tokuda
    Abstract:

    In this paper, an extension of separable lattice HMMs (SL-HMM) is described that introduces State Duration control for dealing with images with various variations. SL-HMM are generative models that have size and location invariances based on State transition of HMMs. An extended model that has the structure of hidden semi-Markov models (HSMMs) in which the State Duration probability is explicitly modeled by parametric distributions is also proposed. However, in this model, each State Duration in a Markov chain is independent. It is supposed that each State Duration should have a correlation. Therefore, in this paper, we propose a novel model that solves this problem by introducing variables representing the correlation among the State Durations. Face recognition experiments show that the proposed model improved the recognition performance for images with size, locational, and rotational variations.

  • face recognition based on separable lattice 2 d hmm with State Duration modeling
    International Conference on Acoustics Speech and Signal Processing, 2010
    Co-Authors: Yoshiaki Takahashi, Akira Tamamori, Yoshihiko Nankaku, Keiichi Tokuda
    Abstract:

    This paper describes an extension of separable lattice 2-D HMMs (SL-HMMs) using State Duration models for image recognition. SL-HMMs are generative models which have size and location invariances based on State transition of HMMs. However, the State Duration probability of HMMs exponentially decreases with increasing Duration, therefore it may not be appropriate for modeling image variations accuratelty. To overcome this problem, we employ the structure of hidden semi Markov models (HSMMs) in which the State Duration probability is explicitly modeled by parametric distributions. Face recognition experiments show that the proposed model improved the performance for images with size and location variations.

  • ICASSP - Face recognition based on separable lattice 2-D HMM with State Duration modeling
    2010 IEEE International Conference on Acoustics Speech and Signal Processing, 2010
    Co-Authors: Yoshiaki Takahashi, Akira Tamamori, Yoshihiko Nankaku, Keiichi Tokuda
    Abstract:

    This paper describes an extension of separable lattice 2-D HMMs (SL-HMMs) using State Duration models for image recognition. SL-HMMs are generative models which have size and location invariances based on State transition of HMMs. However, the State Duration probability of HMMs exponentially decreases with increasing Duration, therefore it may not be appropriate for modeling image variations accuratelty. To overcome this problem, we employ the structure of hidden semi Markov models (HSMMs) in which the State Duration probability is explicitly modeled by parametric distributions. Face recognition experiments show that the proposed model improved the performance for images with size and location variations.

  • ICASSP - Full covariance State Duration modeling for HMM-based speech synthesis
    2009 IEEE International Conference on Acoustics Speech and Signal Processing, 2009
    Co-Authors: Keiichi Tokuda, Li-rong Dai, Ren-hua Wang
    Abstract:

    This paper proposes a State Duration modeling method using full covariance matrix for HMM-based speech synthesis. In this method, a full covariance matrix instead of the conventional diagonal covariance matrix is adopted in the multi-dimensional Gaussian distribution to model the State Duration of each context-dependent phoneme. At synthesis stage, the State Durations are predicted using the clustered context-dependent distributions with full covariance matrices. Experimental results show that the synthesized speech using full-covariance State Duration models is more natural than the conventional method when we change the speaking rate of synthesized speech.

  • A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System
    IEICE Transactions on Information and Systems, 2008
    Co-Authors: Keiichiro Oura, Yoshihiko Nankaku, Heiga Zen, Akinobu Lee, Keiichi Tokuda
    Abstract:

    In a hidden Markov model (HMM), State Duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating State Duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). However, though a number of attempts to use HSMMs in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. By avoiding these approximations using a generalized forward-backward algorithm, a context-dependent Duration modeling technique and weighted finite-State transducers (WFSTs), we construct a fully consistent HSMM-based speech recognition system. In a speaker-dependent continuous speech recognition experiment, our system achieved about 9.1% relative error reduction over the corresponding HMM-based system.

Tadashi Kitamura - One of the best experts on this subject based on the ideXlab platform.

  • a hidden semi markov model based speech synthesis system
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Keiichi Tokuda, Takao Kobayasih, Takashi Masuko, Tadashi Kitamura
    Abstract:

    A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and Duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although State Duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit State Duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the State Duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

  • State Duration Modeling for HMM-Based Speech Synthesis
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Heiga Zen, Takao Kobayasih, Keiichi Tokuda, Takashi Masuko, Takayoshi Yoshimura, Tadashi Kitamura
    Abstract:

    This paper describes the explicit modeling of a State Duration's probability density function in HMM-based speech synthesis. We redefine, in a statistically correct manner, the probability of staying in a State for a time interval used to obtain the State Duration PDF and demonstrate improvements in the Duration of synthesized speech.

  • hidden semi markov model based speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by State Duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine State Durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given State sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit State Duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit State Duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

  • INTERSPEECH - Hidden semi-Markov model based speech synthesis.
    2004
    Co-Authors: Heiga Zen, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by State Duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine State Durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given State sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit State Duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit State Duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

  • simultaneous modeling of spectrum pitch and Duration in hmm based speech synthesis
    Conference of the International Speech Communication Association, 1999
    Co-Authors: Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In this paper, we describe an HMM-based speech synthesis system in which spectrum, pitch and State Duration are modeled simultaneously in a unified framework of HMM. In the system, pitch and State Duration are modeled by multi-space probability distribution HMMs and multi-dimensional Gaussian distributions, respectively. The distributions for spectral parameter, pitch parameter and the State Duration are clustered independently by using a decision-tree based context clustering technique. Synthetic speech is generated by using an speech parameter generation algorithm from HMM and a mel-cepstrum based vocoding technique. Through informal listening tests, we have confirmed that the proposed system successfully synthesizes natural-sounding speech which resembles the speaker in the training database.

Néstor Becerra Yoma - One of the best experts on this subject based on the ideXlab platform.

  • Estimating tonal prosodic discontinuities in Spanish using HMM
    Speech Communication, 2006
    Co-Authors: Alejandro Bassi, Néstor Becerra Yoma, Patricio Loncomilla
    Abstract:

    The tonal prosodic discontinuity estimation in Spanish is exhaustively modelled using HMM. Due to the high morphological complexity in Spanish, a relatively coarse grammatical categorization is tested in two sorts of texts (sentences from newspapers and a theatre play). The estimation of the type of discontinuity (falling or rising tones) at the boundary of intonation groups is assessed. The HMM approach is tested with: (a) modelling the observation probability with monograms, bigrams and full-window probability; (b) State Duration modelling; (c) discriminative analysis of intermediate and final observation vectors and (d) penalization scheme in Viterbi decoding. The optimal configurations led to reductions of 3% or 5% in error detection. The estimation of the observation probability with monograms and bigrams leads to worse results than the ordinary full-window probability, although they provide better generalization. Nevertheless, the performance of the monograms and bigrams approximation can be enhanced if applied in combination with State Duration constraints.

  • Packet-loss modelling in IP networks with State-Duration constraints
    IEE Proceedings - Communications, 2005
    Co-Authors: Néstor Becerra Yoma, Carlos Busso, Ismael Soto
    Abstract:

    A Gilbert–gamma topology is proposed to model packet-loss processes in UDP connections. The proposed topology introduces State Duration modelling with gamma distributions. When compared with the ordinary Gilbert model the proposed topology substantially improves the likelihood of observed packet-loss processes, and gives reductions as high as 70% in the subjective estimation of speech quality transmitted over IP networks. The results presented can be easily applied to other real-time applications such as audio and video streaming.

  • Robust speaker verification with State Duration modeling
    Speech Communication, 2002
    Co-Authors: Néstor Becerra Yoma, Tarciano Facco Pegoraro
    Abstract:

    This paper addresses the problem of State Duration modeling in the Viterbi algorithm in a text-dependent speaker verification task. The results presented in this paper suggest that temporal constraints can lead to reductions of 10% and 20% in the error rates with signals corrupted by noise at SNR equal to 6 and 0 dB, respectively, and that the accurate statistical modeling of State Duration (e.g. with gamma probability distribution) does not seem to be very relevant if maximal and minimal State Duration restrictions are imposed. In contrast, temporal restrictions do not seem to give any improvement in a speaker verification task with clean speech or high SNR. It is also shown that State Duration constraints can easily be applied with the likelihood normalization metrics based on speaker-dependent temporal parameters. Finally, the results here presented show that word position-dependent State Duration parameters give no significant improvement when compared with the word position-independent approach if the coarticulation effect between contiguous words is low.

  • MAP speaker adaptation of State Duration distributions for speech recognition
    IEEE Transactions on Speech and Audio Processing, 2002
    Co-Authors: Néstor Becerra Yoma, Jorge Silva Sánchez
    Abstract:

    This paper presents a framework for maximum a posteriori (MAP) speaker adaptation of State Duration distributions in hidden Markov models (HMM). Four key issues of MAP estimation, namely analysis and modeling of State Duration distributions, the choice of prior distribution, the specification of the parameters of the prior density and the evaluation of the MAP estimates, are tackled. Moreover, a comparison with an adaptation procedure based on maximum likelihood (ML) estimation is presented, and the problem of truncation of the State Duration distribution is addressed from the statistical point of view. The results shown in this paper suggest that the speaker adaptation of temporal restrictions substantially improves the accuracy of speaker-independent (SI) HMM with clean and noisy speech. The method requires a low computational load and a small number of adapting utterances, and can be useful to follow the dynamics of the speaking rate in speech recognition.

  • EUROSPEECH - Temporal constraints in viterbi alignment for speech recognition in noise.
    1999
    Co-Authors: Néstor Becerra Yoma, Lee Luan Ling, Sandra Dotto Stump
    Abstract:

    This paper addresses the problem of temporal constraints in the Viterbi algorithm using conditional transition probabilities. The results here presented suggest that in a speaker dependent small vocabulary task the statistical modelling of State Durations is not relevant if the max and min State Duration restrictions are imposed, and that truncated probability densities give better results than a metric previously proposed [1]. Finally, context dependent and context independent temporal restrictions are compared in a connected word speech recognition task and it is shown that the former leads to better results with the same computational load.

Takao Kobayashi - One of the best experts on this subject based on the ideXlab platform.

  • Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
    IEICE Transactions on Information and Systems, 2007
    Co-Authors: Junichi Yamagishi, Takao Kobayashi
    Abstract:

    In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone Duration. For simultaneous adaptation of spectrum, F0 and phone Duration within the HMM framework, we need to transform not only the State output distributions corresponding to spectrum and F0 but also the Duration distributions corresponding to phone Duration. However, it is not straightforward to adapt the State Duration because the original HMM does not have explicit Duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit State Duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the State output and State Duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the State output and State Duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

  • hidden semi markov model based speech synthesis
    Conference of the International Speech Communication Association, 2004
    Co-Authors: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by State Duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine State Durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given State sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit State Duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit State Duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

  • INTERSPEECH - MLLR adaptation for hidden semi-Markov model based speech synthesis.
    2004
    Co-Authors: Junichi Yamagishi, Takashi Masuko, Takao Kobayashi
    Abstract:

    This paper describes an extension of maximum likelihood linear regression (MLLR) to hidden semi-Markov model (HSMM) and presents an adaptation technique of phoneme/State Duration for an HMM-based speech synthesis system using HSMMs. The HSMM-based MLLR technique can realize the simultaneous adaptation of output distributions and State Duration distributions. We focus on describing mathematical aspect of the technique and derive an algorithm of MLLR adaptation for HSMMs.

  • INTERSPEECH - Hidden semi-Markov model based speech synthesis.
    2004
    Co-Authors: Heiga Zen, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by State Duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine State Durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given State sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit State Duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit State Duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

  • simultaneous modeling of spectrum pitch and Duration in hmm based speech synthesis
    Conference of the International Speech Communication Association, 1999
    Co-Authors: Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura
    Abstract:

    In this paper, we describe an HMM-based speech synthesis system in which spectrum, pitch and State Duration are modeled simultaneously in a unified framework of HMM. In the system, pitch and State Duration are modeled by multi-space probability distribution HMMs and multi-dimensional Gaussian distributions, respectively. The distributions for spectral parameter, pitch parameter and the State Duration are clustered independently by using a decision-tree based context clustering technique. Synthetic speech is generated by using an speech parameter generation algorithm from HMM and a mel-cepstrum based vocoding technique. Through informal listening tests, we have confirmed that the proposed system successfully synthesizes natural-sounding speech which resembles the speaker in the training database.

Mervyn Jack - One of the best experts on this subject based on the ideXlab platform.

  • weighted viterbi algorithm and State Duration modelling for speech recognition in noise
    International Conference on Acoustics Speech and Signal Processing, 1998
    Co-Authors: Néstor Becerra Yoma, Fergus Mcinnes, Mervyn Jack
    Abstract:

    A weighted Viterbi algorithm (HMM) is proposed and applied in combination with spectral subtraction and cepstral mean normalization to cancel both additive and convolutional noise in speech recognition. The weighted Viterbi approach is compared and used in combination with State Duration modelling. The results presented show that a proper weight on the information provided by static parameters can substantially reduce the error rate, and that the weighting procedure improves better the robustness of the Viterbi algorithm than the introduction of temporal constraints with a low computational load. Finally, it is shown that the weighted Viterbi algorithm in combination with temporal constraints leads to a high recognition accuracy at moderate SNRs without the need of an accurate noise model.

  • ICASSP - Weighted Viterbi algorithm and State Duration modelling for speech recognition in noise
    Proceedings of the 1998 IEEE International Conference on Acoustics Speech and Signal Processing ICASSP '98 (Cat. No.98CH36181), 1
    Co-Authors: Néstor Becerra Yoma, Fergus Mcinnes, Mervyn Jack
    Abstract:

    A weighted Viterbi algorithm (HMM) is proposed and applied in combination with spectral subtraction and cepstral mean normalization to cancel both additive and convolutional noise in speech recognition. The weighted Viterbi approach is compared and used in combination with State Duration modelling. The results presented show that a proper weight on the information provided by static parameters can substantially reduce the error rate, and that the weighting procedure improves better the robustness of the Viterbi algorithm than the introduction of temporal constraints with a low computational load. Finally, it is shown that the weighted Viterbi algorithm in combination with temporal constraints leads to a high recognition accuracy at moderate SNRs without the need of an accurate noise model.