Extractive Summarization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

Pushpak Bhattacharyya - One of the best experts on this subject based on the ideXlab platform.

  • wikisent weakly supervised sentiment analysis through Extractive Summarization with wikipedia
    European conference on Machine Learning, 2012
    Co-Authors: Subhabrata Mukherjee, Pushpak Bhattacharyya
    Abstract:

    This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an Extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent.

  • wikisent weakly supervised sentiment analysis through Extractive Summarization with wikipedia
    arXiv: Information Retrieval, 2012
    Co-Authors: Subhabrata Mukherjee, Pushpak Bhattacharyya
    Abstract:

    This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone. The irrelevant text, not directly related to the reviewer opinion on the movie, is left out of analysis. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an Extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. The only weak supervision arises out of the usage of resources like WordNet, Part-of-Speech Tagger and Sentiment Lexicons by virtue of their construction. WikiSent achieves a considerable accuracy improvement over the baseline and has a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent to find the trend in movie-making and the public acceptance in terms of movie genre, year of release and polarity.

Ming Zhou - One of the best experts on this subject based on the ideXlab platform.

  • at which level should we extract an empirical analysis on Extractive document Summarization
    International Conference on Computational Linguistics, 2020
    Co-Authors: Qingyu Zhou, Furu Wei, Ming Zhou
    Abstract:

    Extractive methods have been proven effective in automatic document Summarization. Previous works perform this task by identifying informative contents at sentence level. However, it is unclear whether performing extraction at sentence level is the best solution. In this work, we show that unnecessity and redundancy issues exist when extracting full sentences, and extracting sub-sentential units is a promising alternative. Specifically, we propose extracting sub-sentential units based on the constituency parsing tree. A neural Extractive model which leverages the sub-sentential information and extracts them is presented. Extensive experiments and analyses show that extracting sub-sentential units performs competitively comparing to full sentence extraction under the evaluation of both automatic and human evaluations. Hopefully, our work could provide some inspiration of the basic extraction units in Extractive Summarization for future research.

  • unsupervised Extractive Summarization by pre training hierarchical transformers
    Empirical Methods in Natural Language Processing, 2020
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Unsupervised Extractive document Summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for unsupervised Extractive Summarization. Specifically, we first pre-train a hierarchical transformer model using unlabeled documents only. Then we propose a method to rank sentences using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves state-of-the-art performance on unsupervised Summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    arXiv: Computation and Language, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these \emph{inaccurate} labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders \cite{devlin:2018:arxiv}, we propose {\sc Hibert} (as shorthand for {\bf HI}erachical {\bf B}idirectional {\bf E}ncoder {\bf R}epresentations from {\bf T}ransformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained {\sc Hibert} to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    Meeting of the Association for Computational Linguistics, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these inaccurate labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders (Devlin et al., 2018), we propose Hibert (as shorthand for HIerachical Bidirectional Encoder Representations from Transformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained Hibert to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • neural latent Extractive document Summarization
    Empirical Methods in Natural Language Processing, 2018
    Co-Authors: Xingxing Zhang, Furu Wei, Mirella Lapata, Ming Zhou
    Abstract:

    Extractive Summarization models need sentence level labels, which are usually created with rule-based methods since most Summarization datasets only have document summary pairs. These labels might be suboptimal. We propose a latent variable Extractive model, where sentences are viewed as latent variables and sentences with activated variables are used to infer gold summaries. During training, the loss can come directly from gold summaries. Experiments on CNN/Dailymail dataset show our latent Extractive model outperforms a strong Extractive baseline trained on rule-based labels and also performs competitively with several recent models.

Subhabrata Mukherjee - One of the best experts on this subject based on the ideXlab platform.

  • wikisent weakly supervised sentiment analysis through Extractive Summarization with wikipedia
    European conference on Machine Learning, 2012
    Co-Authors: Subhabrata Mukherjee, Pushpak Bhattacharyya
    Abstract:

    This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an Extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent.

  • wikisent weakly supervised sentiment analysis through Extractive Summarization with wikipedia
    arXiv: Information Retrieval, 2012
    Co-Authors: Subhabrata Mukherjee, Pushpak Bhattacharyya
    Abstract:

    This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone. The irrelevant text, not directly related to the reviewer opinion on the movie, is left out of analysis. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an Extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. The only weak supervision arises out of the usage of resources like WordNet, Part-of-Speech Tagger and Sentiment Lexicons by virtue of their construction. WikiSent achieves a considerable accuracy improvement over the baseline and has a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent to find the trend in movie-making and the public acceptance in terms of movie genre, year of release and polarity.

Furu Wei - One of the best experts on this subject based on the ideXlab platform.

  • at which level should we extract an empirical analysis on Extractive document Summarization
    International Conference on Computational Linguistics, 2020
    Co-Authors: Qingyu Zhou, Furu Wei, Ming Zhou
    Abstract:

    Extractive methods have been proven effective in automatic document Summarization. Previous works perform this task by identifying informative contents at sentence level. However, it is unclear whether performing extraction at sentence level is the best solution. In this work, we show that unnecessity and redundancy issues exist when extracting full sentences, and extracting sub-sentential units is a promising alternative. Specifically, we propose extracting sub-sentential units based on the constituency parsing tree. A neural Extractive model which leverages the sub-sentential information and extracts them is presented. Extensive experiments and analyses show that extracting sub-sentential units performs competitively comparing to full sentence extraction under the evaluation of both automatic and human evaluations. Hopefully, our work could provide some inspiration of the basic extraction units in Extractive Summarization for future research.

  • unsupervised Extractive Summarization by pre training hierarchical transformers
    Empirical Methods in Natural Language Processing, 2020
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Unsupervised Extractive document Summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for unsupervised Extractive Summarization. Specifically, we first pre-train a hierarchical transformer model using unlabeled documents only. Then we propose a method to rank sentences using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves state-of-the-art performance on unsupervised Summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    arXiv: Computation and Language, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these \emph{inaccurate} labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders \cite{devlin:2018:arxiv}, we propose {\sc Hibert} (as shorthand for {\bf HI}erachical {\bf B}idirectional {\bf E}ncoder {\bf R}epresentations from {\bf T}ransformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained {\sc Hibert} to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    Meeting of the Association for Computational Linguistics, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these inaccurate labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders (Devlin et al., 2018), we propose Hibert (as shorthand for HIerachical Bidirectional Encoder Representations from Transformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained Hibert to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • neural latent Extractive document Summarization
    Empirical Methods in Natural Language Processing, 2018
    Co-Authors: Xingxing Zhang, Furu Wei, Mirella Lapata, Ming Zhou
    Abstract:

    Extractive Summarization models need sentence level labels, which are usually created with rule-based methods since most Summarization datasets only have document summary pairs. These labels might be suboptimal. We propose a latent variable Extractive model, where sentences are viewed as latent variables and sentences with activated variables are used to infer gold summaries. During training, the loss can come directly from gold summaries. Experiments on CNN/Dailymail dataset show our latent Extractive model outperforms a strong Extractive baseline trained on rule-based labels and also performs competitively with several recent models.

Xingxing Zhang - One of the best experts on this subject based on the ideXlab platform.

  • unsupervised Extractive Summarization by pre training hierarchical transformers
    Empirical Methods in Natural Language Processing, 2020
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Unsupervised Extractive document Summarization aims to select important sentences from a document without using labeled summaries during training. Existing methods are mostly graph-based with sentences as nodes and edge weights measured by sentence similarities. In this work, we find that transformer attentions can be used to rank sentences for unsupervised Extractive Summarization. Specifically, we first pre-train a hierarchical transformer model using unlabeled documents only. Then we propose a method to rank sentences using sentence-level self-attentions and pre-training objectives. Experiments on CNN/DailyMail and New York Times datasets show our model achieves state-of-the-art performance on unsupervised Summarization. We also find in experiments that our model is less dependent on sentence positions. When using a linear combination of our model and a recent unsupervised model explicitly modeling sentence positions, we obtain even better results.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    arXiv: Computation and Language, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these \emph{inaccurate} labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders \cite{devlin:2018:arxiv}, we propose {\sc Hibert} (as shorthand for {\bf HI}erachical {\bf B}idirectional {\bf E}ncoder {\bf R}epresentations from {\bf T}ransformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained {\sc Hibert} to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • hibert document level pre training of hierarchical bidirectional transformers for document Summarization
    Meeting of the Association for Computational Linguistics, 2019
    Co-Authors: Xingxing Zhang, Furu Wei, Ming Zhou
    Abstract:

    Neural Extractive Summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these inaccurate labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders (Devlin et al., 2018), we propose Hibert (as shorthand for HIerachical Bidirectional Encoder Representations from Transformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained Hibert to our Summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

  • neural latent Extractive document Summarization
    Empirical Methods in Natural Language Processing, 2018
    Co-Authors: Xingxing Zhang, Furu Wei, Mirella Lapata, Ming Zhou
    Abstract:

    Extractive Summarization models need sentence level labels, which are usually created with rule-based methods since most Summarization datasets only have document summary pairs. These labels might be suboptimal. We propose a latent variable Extractive model, where sentences are viewed as latent variables and sentences with activated variables are used to infer gold summaries. During training, the loss can come directly from gold summaries. Experiments on CNN/Dailymail dataset show our latent Extractive model outperforms a strong Extractive baseline trained on rule-based labels and also performs competitively with several recent models.

  • neural latent Extractive document Summarization
    arXiv: Computation and Language, 2018
    Co-Authors: Xingxing Zhang, Furu Wei, Mirella Lapata, Ming Zhou
    Abstract:

    Extractive Summarization models require sentence-level labels, which are usually created heuristically (e.g., with rule-based methods) given that most Summarization datasets only have document-summary pairs. Since these labels might be suboptimal, we propose a latent variable Extractive model where sentences are viewed as latent variables and sentences with activated variables are used to infer gold summaries. During training the loss comes \emph{directly} from gold summaries. Experiments on the CNN/Dailymail dataset show that our model improves over a strong Extractive baseline trained on heuristically approximated labels and also performs competitively to several recent models.