The Experts below are selected from a list of 2415 Experts worldwide ranked by ideXlab platform

Jennifer Greb - One of the best experts on this subject based on the ideXlab platform.

Vpl Digital Services - One of the best experts on this subject based on the ideXlab platform.

Barbie Benson - One of the best experts on this subject based on the ideXlab platform.

Adriana Stan - One of the best experts on this subject based on the ideXlab platform.

  • Lightly supervised GMM VAD to use Audiobook for speech synthesiser
    2013 IEEE International Conference on Acoustics Speech and Signal Processing, 2013
    Co-Authors: Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.j. Clark, Simon King, Adriana Stan
    Abstract:

    Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from Audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on Audiobooks.

  • ICASSP - Lightly supervised GMM VAD to use Audiobook for speech synthesiser
    2013 IEEE International Conference on Acoustics Speech and Signal Processing, 2013
    Co-Authors: Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.j. Clark, Simon King, Adriana Stan
    Abstract:

    Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from Audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on Audiobooks.

Julie Carson-berndsen - One of the best experts on this subject based on the ideXlab platform.

  • SLT - Synthesizing expressive speech from amateur Audiobook recordings
    2012 IEEE Spoken Language Technology Workshop (SLT), 2012
    Co-Authors: Éva Székely, Tamás Gábor Csapó, Bálint Tóth, Péter Mihajlik, Julie Carson-berndsen
    Abstract:

    Freely available Audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from Audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available Audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online Audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical Audiobooks widely available online.

  • Detecting a targeted voice style in an Audiobook using voice quality features
    2012 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2012
    Co-Authors: Éva Székely, John Kane, Stefan Scherer, Christer Gobl, Julie Carson-berndsen
    Abstract:

    Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an Audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a parameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky segments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the regions in an Audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier ensemble output, and evaluated in a listening test.

  • ICASSP - Detecting a targeted voice style in an Audiobook using voice quality features
    2012 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2012
    Co-Authors: Éva Székely, John Kane, Stefan Scherer, Christer Gobl, Julie Carson-berndsen
    Abstract:

    Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an Audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a parameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky segments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the regions in an Audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier ensemble output, and evaluated in a listening test.

  • Synthesizing expressive speech from amateur Audiobook recordings
    2012 IEEE Spoken Language Technology Workshop (SLT), 2012
    Co-Authors: Éva Székely, Tamás Gábor Csapó, Bálint Tóth, Péter Mihajlik, Julie Carson-berndsen
    Abstract:

    Freely available Audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from Audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available Audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online Audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical Audiobooks widely available online.