Large Scale Data

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 360 Experts worldwide ranked by ideXlab platform

Feiping Nie - One of the best experts on this subject based on the ideXlab platform.

  • fast spectral clustering learning with hierarchical bipartite graph for Large Scale Data
    Pattern Recognition Letters, 2020
    Co-Authors: Xiaojun Yang, Rong Wang, Guohao Zhang, Feiping Nie
    Abstract:

    Abstract Spectral clustering (SC) is drawing more and more attention due to its effectiveness in unsupervised learning. However, all of these methods still have limitations. First, the method is not suitable for Large-Scale problems due to its high computational complexity. Second, the neighborhood weighted graph is constructed by the Gaussian kernel, meaning that more work is required to tune the heat-kernel parameter. In order to overcome these issues, we propose a novel spectral clustering based on hierarchical bipartite graph (SCHBG) approach by exploring multiple-layer anchors with a pyramid-style structure. First, the proposed algorithm constructs a hierarchical bipartite graph, and then performs spectral analysis on the graph. As a result, the computational complexity can be Largely reduced. Furthermore, we adopt a parameter-free yet effective neighbor assignment strategy to construct the similarity matrix, which avoids the need to tune the heat-kernel parameter. Finally, the algorithm is able to deal with the out-of-sample problem for Large-Scale Data and its computational complexity is significantly reduced. Experiments demonstrate the efficiency and effectiveness of the proposed SCHBG algorithm. Results show that the SCHBG approach can achieve good clustering accuracy (76%) on an 8-million Datasets. Furthermore, owing to the use of the bipartite graph, the algorithm can reduce the time cost for out-of-sample situations with almost the same clustering accuracy as for Large sizes of Data.

  • fast semisupervised learning with bipartite graph for Large Scale Data
    IEEE Transactions on Neural Networks, 2020
    Co-Authors: Feiping Nie, Rong Wang, Weimin Jia
    Abstract:

    As the captured information in our real word is very scare and labeling sample is time cost and expensive, semisupervised learning (SSL) has an important application in computer vision and machine learning. Among SSL approaches, a graph-based SSL (GSSL) model has recently attracted much attention for high accuracy. However, for most traditional GSSL methods, the Large-Scale Data bring higher computational complexity, which acquires a better computing platform. In order to dispose of these issues, we propose a novel approach, bipartite GSSL normalized (BGSSL-normalized) method, in this paper. This method consists of three parts. First, the bipartite graph between the original Data and the anchor points is constructed, which is parameter-insensitive, Scale-invariant, naturally sparse, and simple operation. Then, the label of the original Data and anchors can be inferred through the graph. Besides, we extend our algorithm to handle out-of-sample for Large-Scale Data by the inferred label of anchors, which not only retains good classification result but also saves a Large amount of time. The computational complexity of BGSSL-normalized can be reduced to $O(ndm+nm^{2})$ , which is a significant improvement compared with traditional GSSL methods that need $O(n^{2}d+n^{3})$ , where $n$ , $d$ , and $m$ are the number of samples, features, and anchors, respectively. The experimental results on several publicly available Data sets demonstrate that our approaches can achieve better classification accuracy with less time costs.

Sherif Sakr - One of the best experts on this subject based on the ideXlab platform.

  • the family of mapreduce and Large Scale Data processing systems
    arXiv: Databases, 2013
    Co-Authors: Sherif Sakr, Anna Liu, Ayman G Fayoumi
    Abstract:

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of Data which has called for a paradigm shift in the computing architecture and Large Scale Data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of Data on Large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on Data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of Large Scale Data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several Large Scale Data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

  • A Survey of Large Scale Data Management Approaches in Cloud Environments
    Communications Surveys & Tutorials, IEEE, 2011
    Co-Authors: Sherif Sakr, Mustafa Alomari, A. X. Liu, Daniel Macedo Batista, Anna Liu, Mohammad Alomari
    Abstract:

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of Data. Moreover, the recent advances in Web technology has made it easy for any user to provide and consume content of any form. This has called for a paradigm shift in the computing architecture and Large Scale Data processing mechanisms. Cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hardware and software resources. This paper gives a comprehensive survey of numerous approaches and mechanisms of deploying Data-intensive applications in the cloud which are gaining a lot of momentum in both research and industrial communities. We analyze the various design decisions of each approach and its suitability to support certain classes of applications and end-users. A discussion of some open issues and future challenges pertaining to scalability, consistency, economical processing of Large Scale Data on the cloud is provided. We highlight the characteristics of the best candidate classes of applications that can be deployed in the cloud.

Mohammad Alomari - One of the best experts on this subject based on the ideXlab platform.

  • A Survey of Large Scale Data Management Approaches in Cloud Environments
    Communications Surveys & Tutorials, IEEE, 2011
    Co-Authors: Sherif Sakr, Mustafa Alomari, A. X. Liu, Daniel Macedo Batista, Anna Liu, Mohammad Alomari
    Abstract:

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of Data. Moreover, the recent advances in Web technology has made it easy for any user to provide and consume content of any form. This has called for a paradigm shift in the computing architecture and Large Scale Data processing mechanisms. Cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hardware and software resources. This paper gives a comprehensive survey of numerous approaches and mechanisms of deploying Data-intensive applications in the cloud which are gaining a lot of momentum in both research and industrial communities. We analyze the various design decisions of each approach and its suitability to support certain classes of applications and end-users. A discussion of some open issues and future challenges pertaining to scalability, consistency, economical processing of Large Scale Data on the cloud is provided. We highlight the characteristics of the best candidate classes of applications that can be deployed in the cloud.

Anna Liu - One of the best experts on this subject based on the ideXlab platform.

  • the family of mapreduce and Large Scale Data processing systems
    arXiv: Databases, 2013
    Co-Authors: Sherif Sakr, Anna Liu, Ayman G Fayoumi
    Abstract:

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of Data which has called for a paradigm shift in the computing architecture and Large Scale Data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of Data on Large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on Data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of Large Scale Data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several Large Scale Data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

  • A Survey of Large Scale Data Management Approaches in Cloud Environments
    Communications Surveys & Tutorials, IEEE, 2011
    Co-Authors: Sherif Sakr, Mustafa Alomari, A. X. Liu, Daniel Macedo Batista, Anna Liu, Mohammad Alomari
    Abstract:

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of Data. Moreover, the recent advances in Web technology has made it easy for any user to provide and consume content of any form. This has called for a paradigm shift in the computing architecture and Large Scale Data processing mechanisms. Cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hardware and software resources. This paper gives a comprehensive survey of numerous approaches and mechanisms of deploying Data-intensive applications in the cloud which are gaining a lot of momentum in both research and industrial communities. We analyze the various design decisions of each approach and its suitability to support certain classes of applications and end-users. A discussion of some open issues and future challenges pertaining to scalability, consistency, economical processing of Large Scale Data on the cloud is provided. We highlight the characteristics of the best candidate classes of applications that can be deployed in the cloud.

Rong Wang - One of the best experts on this subject based on the ideXlab platform.

  • fast spectral clustering learning with hierarchical bipartite graph for Large Scale Data
    Pattern Recognition Letters, 2020
    Co-Authors: Xiaojun Yang, Rong Wang, Guohao Zhang, Feiping Nie
    Abstract:

    Abstract Spectral clustering (SC) is drawing more and more attention due to its effectiveness in unsupervised learning. However, all of these methods still have limitations. First, the method is not suitable for Large-Scale problems due to its high computational complexity. Second, the neighborhood weighted graph is constructed by the Gaussian kernel, meaning that more work is required to tune the heat-kernel parameter. In order to overcome these issues, we propose a novel spectral clustering based on hierarchical bipartite graph (SCHBG) approach by exploring multiple-layer anchors with a pyramid-style structure. First, the proposed algorithm constructs a hierarchical bipartite graph, and then performs spectral analysis on the graph. As a result, the computational complexity can be Largely reduced. Furthermore, we adopt a parameter-free yet effective neighbor assignment strategy to construct the similarity matrix, which avoids the need to tune the heat-kernel parameter. Finally, the algorithm is able to deal with the out-of-sample problem for Large-Scale Data and its computational complexity is significantly reduced. Experiments demonstrate the efficiency and effectiveness of the proposed SCHBG algorithm. Results show that the SCHBG approach can achieve good clustering accuracy (76%) on an 8-million Datasets. Furthermore, owing to the use of the bipartite graph, the algorithm can reduce the time cost for out-of-sample situations with almost the same clustering accuracy as for Large sizes of Data.

  • fast semisupervised learning with bipartite graph for Large Scale Data
    IEEE Transactions on Neural Networks, 2020
    Co-Authors: Feiping Nie, Rong Wang, Weimin Jia
    Abstract:

    As the captured information in our real word is very scare and labeling sample is time cost and expensive, semisupervised learning (SSL) has an important application in computer vision and machine learning. Among SSL approaches, a graph-based SSL (GSSL) model has recently attracted much attention for high accuracy. However, for most traditional GSSL methods, the Large-Scale Data bring higher computational complexity, which acquires a better computing platform. In order to dispose of these issues, we propose a novel approach, bipartite GSSL normalized (BGSSL-normalized) method, in this paper. This method consists of three parts. First, the bipartite graph between the original Data and the anchor points is constructed, which is parameter-insensitive, Scale-invariant, naturally sparse, and simple operation. Then, the label of the original Data and anchors can be inferred through the graph. Besides, we extend our algorithm to handle out-of-sample for Large-Scale Data by the inferred label of anchors, which not only retains good classification result but also saves a Large amount of time. The computational complexity of BGSSL-normalized can be reduced to $O(ndm+nm^{2})$ , which is a significant improvement compared with traditional GSSL methods that need $O(n^{2}d+n^{3})$ , where $n$ , $d$ , and $m$ are the number of samples, features, and anchors, respectively. The experimental results on several publicly available Data sets demonstrate that our approaches can achieve better classification accuracy with less time costs.