Text-to-Speech

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 146129418 Experts worldwide ranked by ideXlab platform

Jiri Kruta - One of the best experts on this subject based on the ideXlab platform.

  • design of speech corpus for text to speech synthesis
    Conference of the International Speech Communication Association, 2001
    Co-Authors: Jindrich Matousek, Josef Psutka, Jiri Kruta
    Abstract:

    This paper deals with the design of a speech corpus for a concatenation-based Text-to-Speech (TTS) synthesis. Several aspects of the design process are discussed here. We propose a sentence selection algorithm to choose sentences (from a large text corpus) which will be read and stored in a speech corpus. The selected sentences should include all possible triphones in a sufficient number of occurrences. Some notes on recording the speech are also discussed to ensure a quality speech corpus. As some popular speech synthesis techniques require knowing the moments of principal excitation of vocal tract during the speech, pitch-mark detection is also a subject of our attention. Several automatic pitch-mark detection methods are discussed here and a comparison test is performed to find out the best method.

Douglas O'shaughnessy - One of the best experts on this subject based on the ideXlab platform.

  • Interacting with computers by voice: Automatic speech recognition and synthesis
    Proceedings of the IEEE, 2003
    Co-Authors: Douglas O'shaughnessy
    Abstract:

    This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or Text-to-Speech (TTS)] performs the reverse task. ASR has been largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic-semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including "letter-to-sound" conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modern synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved.

Satoshi Nakamura - One of the best experts on this subject based on the ideXlab platform.

  • Machine Speech Chain
    'Institute of Electrical and Electronics Engineers (IEEE)', 2020
    Co-Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
    Abstract:

    Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and Text-to-Speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data

  • speech chain for semi supervised learning of japanese english code switching asr and tts
    Spoken Language Technology Workshop, 2018
    Co-Authors: Sahoko Nakayama, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
    Abstract:

    Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and Text-to-Speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.

  • toward multi features emphasis speech translation assessment of human emphasis production and perception with speech and text clues
    Spoken Language Technology Workshop, 2018
    Co-Authors: Sakriani Sakti, Satoshi Nakamura
    Abstract:

    Emphasis is an important factor of human speech that helps convey emotion and the focused information of utterances. Recently, studies have been conducted on speech-to-speech translation to preserve the emphasis information from the source language to the target language. However, since different cultures have various ways of expressing emphasis, just considering the acoustic-to-acoustic feature emphasis translation may not always reflect the experiences of users. On the other hand, emphasis can be expressed at various levels in both text and speech. However, it remains unclear how we communicate emphasis in a different form (acoustic/linguistic) with different levels and whether we can perceive the difference between different levels of emphasis or observe the similarity of the same emphasis levels in both text and speech forms. In this paper, we conducted analyses on human perception of emphasis with both speech and text clues through crowd-sourced evaluations. The results indicate that although participants can distinguish among emphasis levels and perceive the same emphasis level between speech and text, many ambiguities still exist at certain emphasis levels. Thus, our result provides insight into what needs to be handled during the emphasis translation process.

Jindrich Matousek - One of the best experts on this subject based on the ideXlab platform.

  • design of speech corpus for text to speech synthesis
    Conference of the International Speech Communication Association, 2001
    Co-Authors: Jindrich Matousek, Josef Psutka, Jiri Kruta
    Abstract:

    This paper deals with the design of a speech corpus for a concatenation-based Text-to-Speech (TTS) synthesis. Several aspects of the design process are discussed here. We propose a sentence selection algorithm to choose sentences (from a large text corpus) which will be read and stored in a speech corpus. The selected sentences should include all possible triphones in a sufficient number of occurrences. Some notes on recording the speech are also discussed to ensure a quality speech corpus. As some popular speech synthesis techniques require knowing the moments of principal excitation of vocal tract during the speech, pitch-mark detection is also a subject of our attention. Several automatic pitch-mark detection methods are discussed here and a comparison test is performed to find out the best method.

Shinji Watanabe - One of the best experts on this subject based on the ideXlab platform.

  • semi supervised speaker adaptation for end to end speech synthesis with pretrained models
    International Conference on Acoustics Speech and Signal Processing, 2020
    Co-Authors: Katsuki Inoue, Sunao Hara, Masanobu Abe, Tomoki Hayashi, Ryuichi Yamamoto, Shinji Watanabe
    Abstract:

    Recently, end-to-end Text-to-Speech (TTS) models have achieved a remarkable performance, however, requiring a large amount of paired text and speech data for training. On the other hand, we can easily collect unpaired dozen minutes of speech recordings for a target speaker without corresponding text data. To make use of such accessible data, the proposed method leverages the recent great success of state-of-the-art end-to-end automatic speech recognition (ASR) systems and obtains corresponding transcriptions from pretrained ASR models. Although these models could only provide text output instead of intermediate linguistic features like phonemes, end-to-end TTS can be well trained with such raw text data directly. Thus, the proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems. The experimental results show that our proposed method achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.