The Experts below are selected from a list of 73620 Experts worldwide ranked by ideXlab platform

Sebastien Marcell - One of the best experts on this subject based on the ideXlab platform.

  • towards directly modeling raw Speech Signal for speaker verification using cnns
    International Conference on Acoustics Speech and Signal Processing, 2018
    Co-Authors: Hannah Muckenhirn, Mathew Magimai Doss, Sebastien Marcell
    Abstract:

    Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the Speech Signal. In this paper, inspired by the success of neural network-based approaches to model directly raw Speech Signal for applications such as Speech recognition, emotion recognition and anti-spoofing, we propose a speaker verification approach where speaker discriminative information is directly learned from the Speech Signal by: (a) first training a CNN-based speaker identification system that takes as input raw Speech Signal and learns to classify on speakers (unknown to the speaker verification system); and then (b) building a speaker detector for each speaker in the speaker verification system by replacing the output layer of the speaker identification system by two outputs (genuine, impostor), and adapting the system in a discriminative manner with enrollment Speech of the speaker and impostor Speech data. Our investigations on the Voxforge database shows that this approach can yield systems competitive to state-of-the-art systems. An analysis of the filters in the first convolution layer shows that the filters give emphasis to information in low frequency regions (below 1000 Hz) and implicitly learn to model fundamental frequency information in the Speech Signal for speaker discrimination.

Yang Zhen - One of the best experts on this subject based on the ideXlab platform.

  • PCA-Based Compressed Speech Signal Sensing
    Signal Processing, 2011
    Co-Authors: Ji Yun-yun, Yang Zhen
    Abstract:

    Compressed Sensing theory is a new research focus rising in recent years.Before Compressed Sensing theory is applied to Speech Signal processing field,a suitable sparse representation for Speech Signals must be found.Based on principal component analysis theory and a large number of block Signals,features of the Speech Signal are extracted.Moreover,according to Compressed Sensing theory,the method of constructing the dictionary and the characteristics of the Speech Signal,a kind of redundant dictionary, the concatenation of some orthogonal bases,for the sparse representation of Speech Signal is presented in this paper.For more objective description of the advantages of such a sparse representation,the average gini index is applied to compare Speech Signals' sparsity in DCT,GABOR and this redundant dictionary respectively.And male and female Speech Signals and voiced and unvoiced Speech Signals are analysed.Simulation results show that whether male or female Speech Signals and whether voiced or unvoiced Speech Signals,sparsity of the Speech Signal in this redundant dictionary is greatly better than the DCT basis and close to the GABOR basis.However,with its number of atoms far less than GABOR basis and its low computational complexity and storage,this redundant dictionary is more applicable than GABOR basis to Speech Signal.

Sunil Kumar - One of the best experts on this subject based on the ideXlab platform.

  • Voice/non-voice detection using phase of zero frequency filtered Speech Signal
    Speech Communication, 2016
    Co-Authors: Sunil Kumar
    Abstract:

    Voice/non-voice detection refers to the task of detecting the presence or absence of vocal folds activity regions in the Speech Signal. Most of the existing state-of-the-art methods depend exclusively on the amplitude of the Signal either in time or frequency domains, and their performance is significantly affected for weakly voiced, laryngeal transitions and noisy segments of Speech. In this paper, we propose a robust method for detecting voice/non-voice regions in the Speech Signal based on the harmonics of the phase of the source Signal. Here, the source Signal is derived by removing the effect of vocal tract resonances from the Speech Signal by using zero frequency filtering (ZFF). The experimental results demonstrate the robustness of the proposed method for accurate detection of voiced/non-voiced regions in the Speech Signal during adverse conditions. The performance of the proposed method is compared with one of the state-of-the-art methods based on sum of residual harmonics, and three well known standard voice activity detection (VAD) algorithms: G729B, adaptive multi-rate VAD option-1 (AMR1) and adaptive multi-rate VAD option-2 (AMR2).

Hannah Muckenhirn - One of the best experts on this subject based on the ideXlab platform.

  • towards directly modeling raw Speech Signal for speaker verification using cnns
    International Conference on Acoustics Speech and Signal Processing, 2018
    Co-Authors: Hannah Muckenhirn, Mathew Magimai Doss, Sebastien Marcell
    Abstract:

    Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the Speech Signal. In this paper, inspired by the success of neural network-based approaches to model directly raw Speech Signal for applications such as Speech recognition, emotion recognition and anti-spoofing, we propose a speaker verification approach where speaker discriminative information is directly learned from the Speech Signal by: (a) first training a CNN-based speaker identification system that takes as input raw Speech Signal and learns to classify on speakers (unknown to the speaker verification system); and then (b) building a speaker detector for each speaker in the speaker verification system by replacing the output layer of the speaker identification system by two outputs (genuine, impostor), and adapting the system in a discriminative manner with enrollment Speech of the speaker and impostor Speech data. Our investigations on the Voxforge database shows that this approach can yield systems competitive to state-of-the-art systems. An analysis of the filters in the first convolution layer shows that the filters give emphasis to information in low frequency regions (below 1000 Hz) and implicitly learn to model fundamental frequency information in the Speech Signal for speaker discrimination.

Ji Yun-yun - One of the best experts on this subject based on the ideXlab platform.

  • PCA-Based Compressed Speech Signal Sensing
    Signal Processing, 2011
    Co-Authors: Ji Yun-yun, Yang Zhen
    Abstract:

    Compressed Sensing theory is a new research focus rising in recent years.Before Compressed Sensing theory is applied to Speech Signal processing field,a suitable sparse representation for Speech Signals must be found.Based on principal component analysis theory and a large number of block Signals,features of the Speech Signal are extracted.Moreover,according to Compressed Sensing theory,the method of constructing the dictionary and the characteristics of the Speech Signal,a kind of redundant dictionary, the concatenation of some orthogonal bases,for the sparse representation of Speech Signal is presented in this paper.For more objective description of the advantages of such a sparse representation,the average gini index is applied to compare Speech Signals' sparsity in DCT,GABOR and this redundant dictionary respectively.And male and female Speech Signals and voiced and unvoiced Speech Signals are analysed.Simulation results show that whether male or female Speech Signals and whether voiced or unvoiced Speech Signals,sparsity of the Speech Signal in this redundant dictionary is greatly better than the DCT basis and close to the GABOR basis.However,with its number of atoms far less than GABOR basis and its low computational complexity and storage,this redundant dictionary is more applicable than GABOR basis to Speech Signal.