Imbalanced Data

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 14136 Experts worldwide ranked by ideXlab platform

Howard D. Bondell - One of the best experts on this subject based on the ideXlab platform.

  • Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data
    Statistics in Biosciences, 2019
    Co-Authors: Howard D. Bondell
    Abstract:

    Binary classification on Imbalanced Data, i.e., a large skew in the class distribution, is a challenging problem. Evaluation of classifiers via the receiver operating characteristic (ROC) curve is common in binary classification. Techniques to develop classifiers that optimize the area under the ROC curve have been proposed. However, for Imbalanced Data, the ROC curve tends to give an overly optimistic view. Realizing its disadvantages of dealing with Imbalanced Data, we propose an approach based on the Precision–Recall (PR) curve under the binormal assumption. We propose to choose the classifier that maximizes the area under the binormal PR curve. The asymptotic distribution of the resulting estimator is shown. Simulations, as well as real Data results, indicate that the binormal Precision–Recall method outperforms approaches based on the area under the ROC curve.

  • Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data
    Statistics in Biosciences, 2019
    Co-Authors: Howard D. Bondell
    Abstract:

    Binary classification on Imbalanced Data, i.e., a large skew in the class distribution, is a challenging problem. Evaluation of classifiers via the receiver operating characteristic (ROC) curve is common in binary classification. Techniques to develop classifiers that optimize the area under the ROC curve have been proposed. However, for Imbalanced Data, the ROC curve tends to give an overly optimistic view. Realizing its disadvantages of dealing with Imbalanced Data, we propose an approach based on the Precision–Recall (PR) curve under the binormal assumption. We propose to choose the classifier that maximizes the area under the binormal PR curve. The asymptotic distribution of the resulting estimator is shown. Simulations, as well as real Data results, indicate that the binormal Precision–Recall method outperforms approaches based on the area under the ROC curve.

  • binormal precision recall curves for optimal classification of Imbalanced Data
    Statistics in Biosciences, 2019
    Co-Authors: Howard D. Bondell
    Abstract:

    Binary classification on Imbalanced Data, i.e., a large skew in the class distribution, is a challenging problem. Evaluation of classifiers via the receiver operating characteristic (ROC) curve is common in binary classification. Techniques to develop classifiers that optimize the area under the ROC curve have been proposed. However, for Imbalanced Data, the ROC curve tends to give an overly optimistic view. Realizing its disadvantages of dealing with Imbalanced Data, we propose an approach based on the Precision–Recall (PR) curve under the binormal assumption. We propose to choose the classifier that maximizes the area under the binormal PR curve. The asymptotic distribution of the resulting estimator is shown. Simulations, as well as real Data results, indicate that the binormal Precision–Recall method outperforms approaches based on the area under the ROC curve.

Chris Papachristou - One of the best experts on this subject based on the ideXlab platform.

  • IJCNN - Multi-label Imbalanced Data enrichment process in neural net classifier training
    2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008
    Co-Authors: Gorn Tepvorachai, Chris Papachristou
    Abstract:

    Semantic scene classification, robotic state recognition, and many other real-world applications involve multi-label classification with Imbalanced Data. In this paper, we address these problems by using an enrichment process in neural net training. The enrichment process can manage the Imbalanced Data and train the neural net with high classification accuracy. Experimental results on a robotic arm controller show that our method has better generalization performance than traditional neural net training in solving the multi-label and Imbalanced Data problems.

  • Multi-Label Imbalanced Data Enrichment Process in Neural Net
    2008
    Co-Authors: Gorn Tepvorachai, Chris Papachristou
    Abstract:

    Semantic scene classification, robotic state recog- nition, and many other real-world applications involve multi- label classification with Imbalanced Data. In this paper, we address these problems by using an enrichment process in neural net training. The enrichment process can manage the Imbalanced Data and train the neural net with high classification accuracy. Experimental results on a robotic arm controller show that our method has better generalization performance than traditional neural net training in solving the multi-label and Imbalanced Data problems.

Jason Van Hulse - One of the best experts on this subject based on the ideXlab platform.

  • IRI - Hybrid sampling for Imbalanced Data
    2008 IEEE International Conference on Information Reuse and Integration, 2008
    Co-Authors: C. Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse
    Abstract:

    Decision tree learning in the presence of Imbalanced Data is an issue of great practical importance, as such Data is ubiquitous in a wide variety of application domains. We propose hybrid Data sampling, which uses a combination of two sampling techniques such as random oversampling and random undersampling, to create a balanced Dataset for use in the construction of decision tree classification models. The results demonstrate that our methodology is often able to improve the performance of a C4.5 decision tree learner in the context of Imbalanced Data.

  • experimental perspectives on learning from Imbalanced Data
    International Conference on Machine Learning, 2007
    Co-Authors: Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano
    Abstract:

    We present a comprehensive suite of experimentation on the subject of learning from Imbalanced Data. When classes are Imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can Data sampling be used to improve the performance of learners built from Imbalanced Data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.

  • An empirical study of learning from Imbalanced Data using random forest
    Proceedings - International Conference on Tools with Artificial Intelligence ICTAI, 2007
    Co-Authors: Taghi M. Khoshgoftaar, Moiz Golawala, Jason Van Hulse
    Abstract:

    This paper discusses a comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka. RF is a relatively new learner, and to the best of our knowledge, only preliminary experimentation on the construction of random forest classifiers in the context of Imbalanced Data has been reported in previous work. Therefore, the contribution of this study is to provide an extensive empirical evaluation of RF learners built from Imbalanced Data. What should be the recommended default number of trees in the ensemble? What should the recommended value be for the number of attributes? How does the RF learner perform on Imbalanced Data when compared with other commonly-used learners? We address these and other related issues in this work.

  • ICML - Experimental perspectives on learning from Imbalanced Data
    Proceedings of the 24th international conference on Machine learning - ICML '07, 2007
    Co-Authors: Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano
    Abstract:

    We present a comprehensive suite of experimentation on the subject of learning from Imbalanced Data. When classes are Imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can Data sampling be used to improve the performance of learners built from Imbalanced Data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.

Bing-yang Zhang - One of the best experts on this subject based on the ideXlab platform.

  • Stable variable selection of class-Imbalanced Data with precision-recall criterion
    Chemometrics and Intelligent Laboratory Systems, 2017
    Co-Authors: Bing-yang Zhang
    Abstract:

    Abstract Screening important variables for class-Imbalanced Data is still a challenging task. In this study, we propose an algorithm for stably selecting key variables on class-Imbalanced Data based on the precision-recall curve (PRC), where the PRC is utilized as the assessment criterion in the model building stage, and sparse regularized logistic regression combined with subsampling (SRLRS) is designed to perform stable variable selection. Considering the characteristic of class-Imbalanced Data, we also proposed classification-based partition for cross validation, as well as leaving half of majority observations out and leaving one minority observation out (LHO-LOO) for subsampling. Simulation results and real Data showed that our algorithm is highly suitable for handling class-Imbalanced Data, and that the PRC can be an alternative evaluation criterion for model selection when handling class-Imbalanced Data.

Edwardo A. Garcia - One of the best experts on this subject based on the ideXlab platform.

  • Learning from Imbalanced Data
    IEEE Transactions on Knowledge and Data Engineering, 2009
    Co-Authors: Haibo He, Edwardo A. Garcia
    Abstract:

    With the continuous expansion of Data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw Data to support decision-making processes. Although existing knowledge discovery and Data engineering techniques have shown great success in many real-world applications, the problem of learning from Imbalanced Data (the Imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The Imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented Data and severe class distribution skews. Due to the inherent complex characteristics of Imbalanced Data sets, learning from such Data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw Data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from Imbalanced Data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the Imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from Imbalanced Data.