Sentence Boundary

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 3096 Experts worldwide ranked by ideXlab platform

Yang Liu - One of the best experts on this subject based on the ideXlab platform.

  • a non dnn feature engineering approach to dependency parsing fbaml at conll 2017 shared task
    Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies : August 3-4 2017 Vancouver Canada 2017 ISBN 9, 2017
    Co-Authors: Xian Qian, Yang Liu
    Abstract:

    For this year’s multilingual dependency parsing shared task, we developed a pipeline system, which uses a variety of features for each of its components. Unlike the recent popular deep learning approaches that learn low dimensional dense features using non-linear classifier, our system uses structured linear classi- fiers to learn millions of sparse features. Specifically, we trained a linear classifier for Sentence Boundary prediction, linear chain conditional random fields (CRFs) for tokenization, part-of-speech tagging and morph analysis. A second order graph based parser learns the tree structure (without relations), and a linear tree CRF then assigns relations to the dependencies in the tree. Our system achieves reasonable performance –67.87% official averaged macro F1 score.

  • a study in machine learning from imbalanced data for Sentence Boundary detection in speech
    Computer Speech & Language, 2006
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Nitesh V Chawla, Mary P Harper
    Abstract:

    Abstract Enriching speech recognition output with Sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect Sentence boundaries that uses both prosodic and textual information. Since there are more nonSentence boundaries than Sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST Sentence Boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the Sentence Boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the Sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

  • comparing and combining generative and posterior probability models some advances in Sentence Boundary detection in speech
    Empirical Methods in Natural Language Processing, 2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Mary P Harper
    Abstract:

    Abstract : We compare and contrast two different models for detecting Sentence-like units in continuous speech. The first approach uses hidden Markov sequence models based on N-grams and maximum likelihood estimation, and employs model interpolation to combine different representations of the data. The second approach models the posterior probabilities of the target classes; it is discriminative and integrates multiple knowledge sources in the maximum entropy (maxent) framework. Both models combine lexical, syntactic, and prosodic information. We develop a technique for integrating pretrained probability models into the maxent framework, and show that this approach can improve on an HMM-based state-of-the-art system for the Sentence-Boundary detection task. An even more substantial improvement is obtained by combining the posterior probabilities of the two systems.

  • the icsi sri uw rt04 structural metadata extraction system
    2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Barbara Peskin Mary Harper
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe the ICSISRI-UW metadata detection system in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of textual knowledge sources (based on words, partof-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. In addition to our previous HMM approach, we investigate using a maximum entropy (Maxent) and a conditional random field (CRF) approach for various tasks. Results using these techniques are presented for the 2004 NIST Rich Transcription metadata tasks.

  • the icsi sri uw metadata extraction system
    Conference of the International Speech Communication Association, 2004
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Mary P Harper, Dustin Hillard, Mari Ostendorf, B Peskin
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe a state-of-the-art system for automatic detection of “metadata” (information beyond the words) in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. We use confusion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the Sentence Boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.

A. Stolcke - One of the best experts on this subject based on the ideXlab platform.

  • a study in machine learning from imbalanced data for Sentence Boundary detection in speech
    Computer Speech & Language, 2006
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Nitesh V Chawla, Mary P Harper
    Abstract:

    Abstract Enriching speech recognition output with Sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect Sentence boundaries that uses both prosodic and textual information. Since there are more nonSentence boundaries than Sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST Sentence Boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the Sentence Boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the Sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

  • comparing and combining generative and posterior probability models some advances in Sentence Boundary detection in speech
    Empirical Methods in Natural Language Processing, 2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Mary P Harper
    Abstract:

    Abstract : We compare and contrast two different models for detecting Sentence-like units in continuous speech. The first approach uses hidden Markov sequence models based on N-grams and maximum likelihood estimation, and employs model interpolation to combine different representations of the data. The second approach models the posterior probabilities of the target classes; it is discriminative and integrates multiple knowledge sources in the maximum entropy (maxent) framework. Both models combine lexical, syntactic, and prosodic information. We develop a technique for integrating pretrained probability models into the maxent framework, and show that this approach can improve on an HMM-based state-of-the-art system for the Sentence-Boundary detection task. An even more substantial improvement is obtained by combining the posterior probabilities of the two systems.

  • the icsi sri uw rt04 structural metadata extraction system
    2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Barbara Peskin Mary Harper
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe the ICSISRI-UW metadata detection system in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of textual knowledge sources (based on words, partof-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. In addition to our previous HMM approach, we investigate using a maximum entropy (Maxent) and a conditional random field (CRF) approach for various tasks. Results using these techniques are presented for the 2004 NIST Rich Transcription metadata tasks.

  • the icsi sri uw metadata extraction system
    Conference of the International Speech Communication Association, 2004
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Mary P Harper, Dustin Hillard, Mari Ostendorf, B Peskin
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe a state-of-the-art system for automatic detection of “metadata” (information beyond the words) in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. We use confusion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the Sentence Boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.

E. Shriberg - One of the best experts on this subject based on the ideXlab platform.

  • a study in machine learning from imbalanced data for Sentence Boundary detection in speech
    Computer Speech & Language, 2006
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Nitesh V Chawla, Mary P Harper
    Abstract:

    Abstract Enriching speech recognition output with Sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect Sentence boundaries that uses both prosodic and textual information. Since there are more nonSentence boundaries than Sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST Sentence Boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the Sentence Boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the Sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

  • comparing and combining generative and posterior probability models some advances in Sentence Boundary detection in speech
    Empirical Methods in Natural Language Processing, 2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Mary P Harper
    Abstract:

    Abstract : We compare and contrast two different models for detecting Sentence-like units in continuous speech. The first approach uses hidden Markov sequence models based on N-grams and maximum likelihood estimation, and employs model interpolation to combine different representations of the data. The second approach models the posterior probabilities of the target classes; it is discriminative and integrates multiple knowledge sources in the maximum entropy (maxent) framework. Both models combine lexical, syntactic, and prosodic information. We develop a technique for integrating pretrained probability models into the maxent framework, and show that this approach can improve on an HMM-based state-of-the-art system for the Sentence-Boundary detection task. An even more substantial improvement is obtained by combining the posterior probabilities of the two systems.

  • the icsi sri uw rt04 structural metadata extraction system
    2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Barbara Peskin Mary Harper
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe the ICSISRI-UW metadata detection system in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of textual knowledge sources (based on words, partof-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. In addition to our previous HMM approach, we investigate using a maximum entropy (Maxent) and a conditional random field (CRF) approach for various tasks. Results using these techniques are presented for the 2004 NIST Rich Transcription metadata tasks.

  • the icsi sri uw metadata extraction system
    Conference of the International Speech Communication Association, 2004
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Mary P Harper, Dustin Hillard, Mari Ostendorf, B Peskin
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe a state-of-the-art system for automatic detection of “metadata” (information beyond the words) in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. We use confusion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the Sentence Boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.

Mary P Harper - One of the best experts on this subject based on the ideXlab platform.

  • a study in machine learning from imbalanced data for Sentence Boundary detection in speech
    Computer Speech & Language, 2006
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Nitesh V Chawla, Mary P Harper
    Abstract:

    Abstract Enriching speech recognition output with Sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect Sentence boundaries that uses both prosodic and textual information. Since there are more nonSentence boundaries than Sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST Sentence Boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the Sentence Boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the Sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.

  • comparing and combining generative and posterior probability models some advances in Sentence Boundary detection in speech
    Empirical Methods in Natural Language Processing, 2004
    Co-Authors: Yang Liu, E. Shriberg, A. Stolcke, Mary P Harper
    Abstract:

    Abstract : We compare and contrast two different models for detecting Sentence-like units in continuous speech. The first approach uses hidden Markov sequence models based on N-grams and maximum likelihood estimation, and employs model interpolation to combine different representations of the data. The second approach models the posterior probabilities of the target classes; it is discriminative and integrates multiple knowledge sources in the maximum entropy (maxent) framework. Both models combine lexical, syntactic, and prosodic information. We develop a technique for integrating pretrained probability models into the maxent framework, and show that this approach can improve on an HMM-based state-of-the-art system for the Sentence-Boundary detection task. An even more substantial improvement is obtained by combining the posterior probabilities of the two systems.

  • the icsi sri uw metadata extraction system
    Conference of the International Speech Communication Association, 2004
    Co-Authors: E. Shriberg, A. Stolcke, Yang Liu, Mary P Harper, Dustin Hillard, Mari Ostendorf, B Peskin
    Abstract:

    Both human and automatic processing of speech require recognizing more than just the words. We describe a state-of-the-art system for automatic detection of “metadata” (information beyond the words) in both broadcast news and spontaneous telephone conversations, developed as part of the DARPA EARS Rich Transcription program. System tasks include Sentence Boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and automatically induced classes) with information from a prosodic classifier. The prosodic classifier employs bagging and ensemble approaches to better estimate posterior probabilities. We use confusion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the Sentence Boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.

Sandra Maria Aluisio - One of the best experts on this subject based on the ideXlab platform.

  • Sentence segmentation and disfluency detection in narrative transcripts from neuropsychological tests
    Processing of the Portuguese Language, 2018
    Co-Authors: Marcos Vinicius Treviso, Sandra Maria Aluisio
    Abstract:

    Natural Language Processing (NLP) tools aiming at the diagnosis of language impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of Sentence Boundary segmentation in transcripts prevents the direct application of NLP methods which rely on these marks to work properly, such as taggers and parsers. We present a method to segment the transcripts into Sentences and another to detect the disfluencies present in them, to serve as a preprocessing step for the application of subsequent NLP tools. Our methods use recurrent convolutional neural networks with prosodic, morphosyntactic features, and word embeddings. We evaluated both tasks intrinsically, analyzing the most important features, comparing the proposed methods to simpler ones, and identifying the main hits and misses. In addition, a final method was created to combine all tasks and it was evaluated extrinsically using 9 syntactic metrics of Coh-Metrix-Dementia. In the intrinsic evaluations, we showed that our method achieved (i) state-of-the-art results for the Sentence segmentation task on impaired speech, and (ii) results that are similar to related works for the English language for disfluency detection tasks. Regarding the extrinsic evaluation, only 3 metrics showed a statistically significant difference between manual MCI transcripts and those generated by our method, suggesting that our method is capable to preprocess transcriptions to be further analyzed by NLP tools.

  • Detecting mild cognitive impairment in narratives in Brazilian Portuguese: first steps towards a fully automated system
    2018
    Co-Authors: Marcos Vinicius Treviso, Christopher Shulby, Leandro Borges Dos Santos, Lilian Cristine Hübner, Letícia Lessa Mansur, Sandra Maria Aluisio
    Abstract:

    Abstract: In recent years, Mild Cognitive Impairment (MCI) has received a great deal of attention, as it may represent a pre-clinical state of Alzheimer's disease (AD). In the distinction between healthy elderly (CTL) and MCI patients, automated discourse analysis tools have been applied to narrative transcripts in English and in Brazilian Portuguese. However, the absence of Sentence Boundary segmentation in transcripts prevents the direct application of methods that rely on these marks for the correct use of tools, such as taggers and parsers. To our knowledge, there are only a few studies evaluating automatic Sentence segmentation in transcripts of neuropsychological tests. The purpose of this study is to investigate the impact of the automatic Sentence segmentation method DeepBond on nine syntactic complexity metrics extracted of transcripts of CTL and MCI patients.

  • Sentence segmentation in narrative transcripts from neuropsychological tests using recurrent convolutional neural networks
    Conference of the European Chapter of the Association for Computational Linguistics, 2017
    Co-Authors: Marcos Vinicius Treviso, Christopher Shulby, Sandra Maria Aluisio
    Abstract:

    Automated discourse analysis tools based on Natural Language Processing (NLP) aiming at the diagnosis of language-impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of Sentence Boundary segmentation in the transcripts prevents the direct application of NLP methods which rely on these marks in order to function properly, such as taggers and parsers. We present the first steps taken towards automatic neuropsychological evaluation based on narrative discourse analysis, presenting a new automatic Sentence segmentation method for impaired speech. Our model uses recurrent convolutional neural networks with prosodic, Part of Speech (PoS) features, and word embeddings. It was evaluated intrinsically on impaired, spontaneous speech as well as normal, prepared speech and presents better results for healthy elderly (CTL) (F1 = 0.74) and Mild Cognitive Impairment (MCI) patients (F1 = 0.70) than the Conditional Random Fields method (F1 = 0.55 and 0.53, respectively) used in the same context of our study. The results suggest that our model is robust for impaired speech and can be used in automated discourse analysis tools to differentiate narratives produced by MCI and CTL.

  • Sentence segmentation in narrative transcripts from neuropsychological tests using recurrent convolutional neural networks
    arXiv: Computation and Language, 2016
    Co-Authors: Marcos Vinicius Treviso, Christopher Shulby, Sandra Maria Aluisio
    Abstract:

    Automated discourse analysis tools based on Natural Language Processing (NLP) aiming at the diagnosis of language-impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of Sentence Boundary segmentation in the transcripts prevents the direct application of NLP methods which rely on these marks to function properly, such as taggers and parsers. We present the first steps taken towards automatic neuropsychological evaluation based on narrative discourse analysis, presenting a new automatic Sentence segmentation method for impaired speech. Our model uses recurrent convolutional neural networks with prosodic, Part of Speech (PoS) features, and word embeddings. It was evaluated intrinsically on impaired, spontaneous speech, as well as, normal, prepared speech, and presents better results for healthy elderly (CTL) (F1 = 0.74) and Mild Cognitive Impairment (MCI) patients (F1 = 0.70) than the Conditional Random Fields method (F1 = 0.55 and 0.53, respectively) used in the same context of our study. The results suggest that our model is robust for impaired speech and can be used in automated discourse analysis tools to differentiate narratives produced by MCI and CTL.