Unlabeled Data

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 32412 Experts worldwide ranked by ideXlab platform

Masashi Sugiyama - One of the best experts on this subject based on the ideXlab platform.

  • classification from pairwise similarities dissimilarities and Unlabeled Data via empirical risk minimization
    Neural Computation, 2021
    Co-Authors: Takuya Shimada, Issei Sato, Han Bao, Masashi Sugiyama
    Abstract:

    Pairwise similarities and dissimilarities between Data points are often obtained more easily than full labels of Data in real-world classification problems. To make use of such pairwise information, an empirical risk minimization approach has been proposed, where an unbiased estimator of the classification risk is computed from only pairwise similarities and Unlabeled Data. However, this approach has not yet been able to handle pairwise dissimilarities. Semisupervised clustering methods can incorporate both similarities and dissimilarities into their framework; however, they typically require strong geometrical assumptions on the Data distribution such as the manifold assumption, which may cause severe performance deterioration. In this letter, we derive an unbiased estimator of the classification risk based on all of similarities and dissimilarities and Unlabeled Data. We theoretically establish an estimation error bound and experimentally demonstrate the practical usefulness of our empirical risk minimization method.

  • classification from pairwise similarities dissimilarities and Unlabeled Data via empirical risk minimization
    arXiv: Learning, 2019
    Co-Authors: Takuya Shimada, Issei Sato, Han Bao, Masashi Sugiyama
    Abstract:

    Pairwise similarities and dissimilarities between Data points might be easier to obtain than fully labeled Data in real-world classification problems, e.g., in privacy-aware situations. To handle such pairwise information, an empirical risk minimization approach has been proposed, giving an unbiased estimator of the classification risk that can be computed only from pairwise similarities and Unlabeled Data. However, this direction cannot handle pairwise dissimilarities so far. On the other hand, semi-supervised clustering is one of the methods which can use both similarities and dissimilarities. Nevertheless, they typically require strong geometrical assumptions on the Data distribution such as the manifold assumption, which may deteriorate the performance. In this paper, we derive an unbiased risk estimator which can handle all of similarities/dissimilarities and Unlabeled Data. We theoretically establish estimation error bounds and experimentally demonstrate the practical usefulness of our empirical risk minimization method.

  • classification from pairwise similarity and Unlabeled Data
    International Conference on Machine Learning, 2018
    Co-Authors: Han Bao, Gang Niu, Masashi Sugiyama
    Abstract:

    Supervised learning needs a huge amount of labeled Data, which can be a big bottleneck under the situation where there is a privacy concern or labeling cost is high. To overcome this problem, we propose a new weakly-supervised learning setting where only similar (S) Data pairs (two examples belong to the same class) and Unlabeled (U) Data points are needed instead of fully labeled Data, which is called SU classification. We show that an unbiased estimator of the classification risk can be obtained only from SU Data, and the estimation error of its empirical risk minimizer achieves the optimal parametric convergence rate. Finally, we demonstrate the effectiveness of the proposed method through experiments.

  • classification from pairwise similarity and Unlabeled Data
    arXiv: Learning, 2018
    Co-Authors: Masashi Sugiyama
    Abstract:

    One of the biggest bottlenecks in supervised learning is its high labeling cost. To overcome this problem, we propose a new weakly-supervised learning setting called SU classification, where only similar (S) Data pairs (two examples belong to the same class) and Unlabeled (U) Data are needed, instead of fully-supervised Data. We show that an unbiased estimator of the classification risk can be obtained only from SU Data, and its empirical risk minimizer achieves the optimal parametric convergence rate. Finally, we demonstrate the effectiveness of the proposed method through experiments.

  • semi supervised classification based on classification from positive and Unlabeled Data
    International Conference on Machine Learning, 2017
    Co-Authors: Tomoya Sakai, Marthinus Christoffel Du Plessis, Gang Niu, Masashi Sugiyama
    Abstract:

    Most of the semi-supervised classification methods developed so far use Unlabeled Data for regularization purposes under particular distributional assumptions such as the cluster assumption. In contrast, recently developed methods of classification from positive and Unlabeled Data (PU classification) use Unlabeled Data for risk evaluation, i.e., label information is directly extracted from Unlabeled Data. In this paper, we extend PU classification to also incorporate negative Data and propose a novel semi-supervised classification approach. We establish generalization error bounds for our novel methods and show that the bounds decrease with respect to the number of Unlabeled Data without the distributional assumptions that are required in existing semi-supervised classification methods. Through experiments, we demonstrate the usefulness of the proposed methods.

Yuan Jiang - One of the best experts on this subject based on the ideXlab platform.

  • spanning attack reinforce black box attacks with Unlabeled Data
    Machine Learning, 2020
    Co-Authors: Lu Wang, Huan Zhang, Chojui Hsieh, Yuan Jiang
    Abstract:

    Adversarial black-box attacks aim to craft adversarial perturbations by querying input–output pairs of machine learning models. They are widely used to evaluate the robustness of pre-trained models. However, black-box attacks often suffer from the issue of query inefficiency due to the high dimensionality of the input space, and therefore incur a false sense of model robustness. In this paper, we relax the conditions of the black-box threat model, and propose a novel technique called the spanning attack. By constraining adversarial perturbations in a low-dimensional subspace via spanning an auxiliary Unlabeled Dataset, the spanning attack significantly improves the query efficiency of a wide variety of existing black-box attacks. Extensive experiments show that the proposed method works favorably in both soft-label and hard-label black-box attacks.

  • spanning attack reinforce black box attacks with Unlabeled Data
    arXiv: Learning, 2020
    Co-Authors: Lu Wang, Huan Zhang, Chojui Hsieh, Yuan Jiang
    Abstract:

    Adversarial black-box attacks aim to craft adversarial perturbations by querying input-output pairs of machine learning models. They are widely used to evaluate the robustness of pre-trained models. However, black-box attacks often suffer from the issue of query inefficiency due to the high dimensionality of the input space, and therefore incur a false sense of model robustness. In this paper, we relax the conditions of the black-box threat model, and propose a novel technique called the spanning attack. By constraining adversarial perturbations in a low-dimensional subspace via spanning an auxiliary Unlabeled Dataset, the spanning attack significantly improves the query efficiency of black-box attacks. Extensive experiments show that the proposed method works favorably in both soft-label and hard-label black-box attacks. Our code is available at this https URL.

  • exploiting Unlabeled Data in content based image retrieval
    Lecture Notes in Computer Science, 2004
    Co-Authors: Zhihua Zhou, Kejia Chen, Yuan Jiang
    Abstract:

    In this paper, the SSAIR (Semi-Supervised Active Image Retrieval) approach, which attempts to exploit Unlabeled Data to improve the performance of content-based image retrieval (CBIR), is proposed. This approach combines the merits of semi-supervised learning and active learning. In detail, in each round of relevance feedback, two simple learners are trained from the labeled Data, i.e. images from user query and user feedback. Each learner then classifies the Unlabeled images in the Database and passes the most relevant/irrelevant images to the other learner. After re-training with the additional labeled Data, the learners classify the images in the Database again and then their classifications are merged. Images judged to be relevant with high confidence are returned as the retrieval result, while these judged with low confidence are put into the pool which is used in the next round of relevance feedback. Experiments show that semi-supervised learning and active learning mechanisms are both beneficial to CBIR.

Zhihua Zhou - One of the best experts on this subject based on the ideXlab platform.

  • Towards Making Unlabeled Data Never Hurt
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015
    Co-Authors: Yu-feng Li, Zhihua Zhou
    Abstract:

    It is usually expected that learning performance can be improved by exploiting Unlabeled Data, particularly when the number of labeled Data is limited. However, it has been reported that, in some cases existing semi-supervised learning approaches perform even worse than supervised ones which only use labeled Data. For this reason, it is desirable to develop safe semi-supervised learning approaches that will not significantly reduce learning performance when Unlabeled Data are used. This paper focuses on improving the safeness of semi-supervised support vector machines (S3VMs). First, the S3VM-us approach is proposed. It employs a conservative strategy and uses only the Unlabeled instances that are very likely to be helpful, while avoiding the use of highly risky ones. This approach improves safeness but its performance improvement using Unlabeled Data is often much smaller than S3VMs. In order to develop a safe and well-performing approach, we examine the fundamental assumption of S3VMs, i.e., low-density separation. Based on the observation that multiple good candidate low-density separators may be identified from training Data, safe semi-supervised support vector machines (S4VMs) are here proposed. This approach uses multiple low-density separators to approximate the ground-truth decision boundary and maximizes the improvement in performance of inductive SVMs for any candidate separator. Under the assumption employed by S3VMs, it is here shown that S4VMs are provably safe and that the performance improvement using Unlabeled Data can be maximized. An out-of-sample extension of S4VMs is also presented. This extension allows S4VMs to make predictions on unseen instances. Our empirical study on a broad range of Data shows that the overall performance of S4VMs is highly competitive with S3VMs, whereas in contrast to S3VMs which hurt performance significantly in many cases, S4VMs rarely perform worse than inductive SVMs.

  • learning with augmented class by exploiting Unlabeled Data
    National Conference on Artificial Intelligence, 2014
    Co-Authors: Zhihua Zhou
    Abstract:

    In many real-world applications of learning, the environment is open and changes gradually, which requires the learning system to have the ability of detecting and adapting to the changes. Class-incremental learning (CIL) is an important and practical problem where Data from unseen augmented classes are fed, but has not been studied well in the past. In C-IL, the system should beware of predicting instances from augmented classes as a seen class, and thus faces the challenge that no such instances were observed during training stage. In this paper, we tackle the challenge by using Unlabeled Data, which can be cheaply collected in many real-world applications. We propose the LACU framework as well as the LACU-SVM approach to learn the concept of seen classes while incorporating the structure presented in the Unlabeled Data, so that the misclassification risks among the seen classes as well as between the augmented and the seen classes are minimized simultaneously. Experiments on diverse Datasets show the effectiveness of the proposed approach.

  • towards making Unlabeled Data never hurt
    International Conference on Machine Learning, 2011
    Co-Authors: Zhihua Zhou
    Abstract:

    It is usually expected that, when labeled Data are limited, the learning performance can be improved by exploiting Unlabeled Data. In many cases, however, the performances of current semi-supervised learning approaches may be even worse than purely using the limited labeled Data. It is desired to have safe semi-supervised learning approaches which never degenerate learning performance by using Unlabeled Data. In this paper, we focus on semi-supervised support vector machines (S3VMs) and propose S4VMs, i.e., safe S3VMs. Unlike S3VMs which typically aim at approaching an optimal low-density separator, S4VMs try to exploit the candidate low-density separators simultaneously to reduce the risk of identifying a poor separator with Unlabeled Data. We describe two implementations of S4VMs, and our comprehensive experiments show that the overall performance of S4VMs are highly competitive to S3VMs, while in contrast to S3VMs which degenerate performance in many cases, S4VMs are never significantly inferior to inductive SVMs.

  • exploiting Unlabeled Data to enhance ensemble diversity
    arXiv: Learning, 2009
    Co-Authors: Minling Zhang, Zhihua Zhou
    Abstract:

    Ensemble learning aims to improve generalization ability by using multiple base learners. It is well-known that to construct a good ensemble, the base learners should be accurate as well as diverse. In this paper, Unlabeled Data is exploited to facilitate ensemble learning by helping augment the diversity among the base learners. Specifically, a semi-supervised ensemble method named UDEED is proposed. Unlike existing semi-supervised ensemble methods where error-prone pseudo-labels are estimated for Unlabeled Data to enlarge the labeled Data to improve accuracy, UDEED works by maximizing accuracies of base learners on labeled Data while maximizing diversity among them on Unlabeled Data. Experiments show that UDEED can effectively utilize Unlabeled Data for ensemble learning and is highly competitive to well-established semi-supervised ensemble methods.

  • enhancing relevance feedback in image retrieval using Unlabeled Data
    ACM Transactions on Information Systems, 2006
    Co-Authors: Zhihua Zhou, Kejia Chen, Hongbin Dai
    Abstract:

    Relevance feedback is an effective scheme bridging the gap between high-level semantics and low-level features in content-based image retrieval (CBIR). In contrast to previous methods which rely on labeled images provided by the user, this article attempts to enhance the performance of relevance feedback by exploiting Unlabeled images existing in the Database. Concretely, this article integrates the merits of semisupervised learning and active learning into the relevance feedback process. In detail, in each round of relevance feedback two simple learners are trained from the labeled Data, that is, images from user query and user feedback. Each learner then labels some Unlabeled images in the Database for the other learner. After retraining with the additional labeled Data, the learners reclassify the images in the Database and then their classifications are merged. Images judged to be positive with high confidence are returned as the retrieval result, while those judged with low confidence are put into the pool which is used in the next round of relevance feedback. Experiments show that using semisupervised learning and active learning simultaneously in CBIR is beneficial, and the proposed method achieves better performance than some existing methods.

Wanlei Zhou - One of the best experts on this subject based on the ideXlab platform.

  • Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination
    IEEE Transactions on Knowledge and Data Engineering, 2020
    Co-Authors: Tao Zhang, Tianqing Zhu, Mengde Han, Wanlei Zhou
    Abstract:

    A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether semi-supervised learning might be useful to solve discrimination problems. First, previous study showed that increasing the size of the training set may lead to a better trade-off between fairness and accuracy. Second, the most powerful models today require an enormous of Data to train which, in practical terms, is likely possible from a combination of labeled and Unlabeled Data. Hence, in this paper, we present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for Unlabeled Data, a re-sampling method to obtain multiple fair Datasets and lastly, ensemble learning to improve accuracy and decrease discrimination. A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning. A set of experiments on real-world and synthetic Datasets show that our method is able to use Unlabeled Data to achieve a better trade-off between accuracy and discrimination.

  • Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination
    IEEE Transactions on Knowledge and Data Engineering, 2024
    Co-Authors: Tao Zhang, Tianqing Zhu, Mengde Han, Wanlei Zhou
    Abstract:

    A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether semi-supervised learning might be useful to solve discrimination problems. First, previous study showed that increasing the size of the training set may lead to a better trade-off between fairness and accuracy. Second, the most powerful models today require an enormous of Data to train which, in practical terms, is likely possible from a combination of labeled and Unlabeled Data. Hence, in this paper, we present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for Unlabeled Data, a re-sampling method to obtain multiple fair Datasets and lastly, ensemble learning to improve accuracy and decrease discrimination. A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning.

Tom M Mitchell - One of the best experts on this subject based on the ideXlab platform.

  • estimating accuracy from Unlabeled Data a probabilistic logic approach
    Neural Information Processing Systems, 2017
    Co-Authors: Emmanouil Antonios Platanios, Tom M Mitchell, Hoifung Poon, Eric Horvitz
    Abstract:

    We propose an efficient method to estimate the accuracy of classifiers using only Unlabeled Data. We consider a setting with multiple classification problems where the target classes may be tied together through logical constraints. For example, a set of classes may be mutually exclusive, meaning that a Data instance can belong to at most one of them. The proposed method is based on the intuition that: (i) when classifiers agree, they are more likely to be correct, and (ii) when the classifiers make a prediction that violates the constraints, at least one classifier must be making an error. Experiments on four real-world Data sets produce accuracy estimates within a few percent of the true accuracy, using solely Unlabeled Data. Our models also outperform existing state-of-the-art solutions in both estimating accuracies, and combining multiple classifier outputs. The results emphasize the utility of logical constraints in estimating accuracy, thus validating our intuition.

  • estimating accuracy from Unlabeled Data a bayesian approach
    International Conference on Machine Learning, 2016
    Co-Authors: Emmanouil Antonios Platanios, Avinava Dubey, Tom M Mitchell
    Abstract:

    We consider the question of how Unlabeled Data can be used to estimate the true accuracy of learned classifiers, and the related question of how outputs from several classifiers performing the same task can be combined based on their estimated accuracies. To answer these questions, we first present a simple graphical model that performs well in practice. We then provide two nonparametric extensions to it that improve its performance. Experiments on two real-world Data sets produce accuracy estimates within a few percent of the true accuracy, using solely Unlabeled Data. Our models also outperform existing state-of-the-art solutions in both estimating accuracies, and combining multiple classifier outputs.

  • estimating accuracy from Unlabeled Data
    Uncertainty in Artificial Intelligence, 2014
    Co-Authors: Emmanouil Antonios Platanios, Avrim Blum, Tom M Mitchell
    Abstract:

    We consider the question of how Unlabeled Data can be used to estimate the true accuracy of learned classifiers. This is an important question for any autonomous learning system that must estimate its accuracy without supervision, and also when classifiers trained from one Data distribution must be applied to a new distribution (e.g., document classifiers trained on one text corpus are to be applied to a second corpus). We first show how to estimate error rates exactly from Unlabeled Data when given a collection of competing classifiers that make independent errors, based on the agreement rates between subsets of these classifiers. We further show that even when the competing classifiers do not make independent errors, both their accuracies and error dependencies can be estimated by making certain relaxed assumptions. Experiments on two Data real-world Data sets produce estimates within a few percent of the true accuracy, using solely un-labeled Data. These results are of practical significance in situations where labeled Data is scarce and shed light on the more general question of how the consistency among multiple functions is related to their true accuracies.

  • the role of Unlabeled Data in supervised learning
    2004
    Co-Authors: Tom M Mitchell
    Abstract:

    Most computational models of supervised learning rely only on labeled training examples, and ignore the possible role of Unlabeled Data. This is true both for cognitive science models of learning such as SOAR [Newell 1990] and ACT–R [Anderson, et al. 1995], and for machine learning and Data mining algorithms such as decision tree learning and inductive logic programming (see, e.g., [Mitchell 1997]). In this paper we consider the potential role of Unlabeled Data in supervised learning. We present an algorithm and experimental results demonstrating that Unlabeled Data can significantly improve learning accuracy in certain practical problems. We then identify the abstract problem structure that enables the algorithm to successfully utilize this Unlabeled Data, and prove that Unlabeled Data will boost learning accuracy for problems in this class. The problem class we identify includes problems where the features describing the examples are redundantly sufficient for classifying the example; a notion we make precise in this paper. This problem class includes many natural learning problems faced by humans, such as learning a semantic lexicon over noun phrases in natural language, and learning to recognize objects from multiple sensor inputs. We argue that models of human and animal learning should consider more strongly the potential role of Unlabeled Data, and that many natural learning problems fit the class we identify.

  • combining labeled and Unlabeled Data with co training
    Conference on Learning Theory, 1998
    Co-Authors: Avrim Blum, Tom M Mitchell
    Abstract:

    We consider the problem of using a large Unlabeled sample to boost performance of a learning algorit,hrn when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks t,hat point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled Data, but our goal is to use both views together to allow inexpensive Unlabeled Data to augment, a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm’s predictions on new Unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and Unlabeled Data. We also provide empirical results on real web-page Data indicating that this use of Unlabeled examples can lead to significant improvement of hypotheses in practice. *This research was supported in part by the DARPA HPKB program under contract F30602-97-1-0215 and by NSF National Young investigator grant CCR-9357793. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. TO copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. COLT 98 Madison WI USA Copyright ACM 1998 l-58113-057--0/98/ 7...%5.00 92 Tom Mitchell School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3891 mitchell+@cs.cmu.edu