Data Partitioning

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 93579 Experts worldwide ranked by ideXlab platform

Hojun Song - One of the best experts on this subject based on the ideXlab platform.

  • searching for the optimal Data Partitioning strategy in mitochondrial phylogenomics a phylogeny of acridoidea insecta orthoptera caelifera as a case study
    Molecular Phylogenetics and Evolution, 2013
    Co-Authors: James R Leavitt, Kevin D Hiatt, Michael F Whiting, Hojun Song
    Abstract:

    One of the main challenges in analyzing multi-locus phylogenomic Data is to find an optimal Data Partitioning strategy to account for variable evolutionary histories of different loci for any given Dataset. Although a number of studies have addressed the issue of Data Partitioning in a Bayesian phylogenetic framework, such studies in a maximum likelihood framework are comparatively lacking. Furthermore, a rigorous statistical exploration of possible Data Partitioning schemes has not been applied to mitochondrial genome (mtgenome) Data, which provide a complex, but manageable platform for addressing various challenges in analyzing phylogenomic Data. In this study, we investigate the issue of Data Partitioning in the maximum likelihood framework in the context of the mitochondrial phylogenomics of an orthopteran superfamily Acridoidea (Orthoptera: Caelifera). The present study analyzes 34 terminals representing all 8 superfamilies within Caelifera, which includes newly sequenced partial or complete mtgenomes for 11 families. Using a new partition-selection method implemented in the software PartitionFinder, we compare a large number of Data Partitioning schemes in an attempt to identify the most effective method of analyzing the mtgenome Data. We find that the best-fit Partitioning scheme selected by PartitionFinder is superior to any a priori schemes commonly utilized in mitochondrial phylogenomics. We also show that over-Partitioning is often detrimental to phylogenetic reconstruction. A comparative analysis of mtgenome structures finds that the tRNA gene rearrangement between cytochrome c oxidase subunit II and ATP synthase protein 8 does not occur in the most basal caeliferan lineage Tridactyloidea, suggesting that this gene rearrangement must have evolved at least in the common ancestor of Tetrigoidea and Acridomorpha. We find that mtgenome Data contain sufficient phylogenetic information to broadly resolve the relationships across Acridomorpha and Acridoidea.

Mihaela Cocea - One of the best experts on this subject based on the ideXlab platform.

  • Subclass-based semi-random Data Partitioning for improving sample representativeness
    Information Sciences, 2019
    Co-Authors: Han Liu, Shyi-ming Chen, Mihaela Cocea
    Abstract:

    Abstract In machine learning tasks, it is essential for a Data set to be partitioned into a training set and a test set in a specific ratio. In this context, the training set is used for learning a model for making predictions on new instances, whereas the test set is used for evaluating the prediction accuracy of a model on new instances. In the context of human learning, a training set can be viewed as learning material that covers knowledge, whereas a test set can be viewed as an exam paper that provides questions for students to answer. In practice, Data Partitioning has typically been done by randomly selecting 70% instances for training and the rest for testing. In this paper, we argue that random Data Partitioning is likely to result in the sample representativeness issue, i.e., training and test instances show very dissimilar characteristics leading to the case similar to testing students on material that was not taught. To address the above issue, we propose a subclass-based semi-random Data Partitioning approach. The experimental results show that the proposed Data Partitioning approach leads to significant advances in learning performance due to the improvement of sample representativeness.

  • Multi-granularity Semi-random Data Partitioning
    Studies in Big Data, 2017
    Co-Authors: Han Liu, Mihaela Cocea
    Abstract:

    In this chapter, we introduce the concepts of semi-heuristic Data Partitioning, and present a proposed multi-granularity framework for semi-heuristic Data Partitioning. We also discuss the advantages of the proposed framework in terms of dealing with class imbalance and the sample representativeness issue, from granular computing perspectives.

Alexey Lastovetsky - One of the best experts on this subject based on the ideXlab platform.

  • Data Partitioning on multicore and multi gpu platforms using functional performance models
    IEEE Transactions on Computers, 2015
    Co-Authors: Ziming Zhong, Vladimir Rychkov, Alexey Lastovetsky
    Abstract:

    Heterogeneous multiprocessor systems, which are composed of a mix of processing elements, such as commodity multicore processors, graphics processing units (GPUs), and others, have been widely used in scientific computing community. Software applications incorporate the code designed and optimized for different types of processing elements in order to exploit the computing power of such heterogeneous computing systems. In this paper, we consider the problem of optimal distribution of the workload of Data-parallel scientific applications between processing elements of such heterogeneous computing systems. We present a solution that uses functional performance models (FPMs) of processing elements and FPM-based Data Partitioning algorithms. Efficiency of this approach is demonstrated by experiments with parallel matrix multiplication and numerical simulation of lid-driven cavity flow on hybrid servers and clusters.

  • Data Partitioning on heterogeneous multicore and multi gpu systems using functional performance models of Data parallel applications
    International Conference on Cluster Computing, 2012
    Co-Authors: Ziming Zhong, Vladimir Rychkov, Alexey Lastovetsky
    Abstract:

    Transition to hybrid CPU/GPU platforms in high performance computing is challenging in the aspect of efficient utilisation of the heterogeneous hardware and existing optimised software. During recent years, scientific software has been ported to multicore and GPU architectures and now should be reused on hybrid platforms. In this paper, we model the performance of such scientific applications in order to execute them efficiently on hybrid platforms. We consider a hybrid platform as a heterogeneous distributed-memory system and apply the approach of functional performance models, which was originally designed for uniprocessor machines. The functional performance model (FPM) represents the processor speed by a function of problem size and integrates many important features characterising the performance of the architecture and the application. We demonstrate that FPMs facilitate performance evaluation of scientific applications on hybrid platforms. FPM-based Data Partitioning algorithms have been proved to be accurate for load balancing on heterogeneous networks of uniprocessor computers. We apply FPM-based Data Partitioning to balance the load between cores and GPUs in the hybrid architecture. In our experiments with parallel matrix multiplication, we couple the existing software optimised for multicores and GPUs and achieve high performance of the whole hybrid system.

  • CLUSTER - Data Partitioning on Heterogeneous Multicore Platforms
    2011 IEEE International Conference on Cluster Computing, 2011
    Co-Authors: Ziming Zhong, Vladimir Rychkov, Alexey Lastovetsky
    Abstract:

    In this paper, we present two techniques for inter- and intra-node Data Partitioning aimed at load balancing MPI applications on heterogeneous multicore platforms. For load balancing between the multicore nodes of a heterogeneous multicore cluster, we propose how to define a functional performance model of an individual multicore node as a single computing unit, and use these models for Data Partitioning between the nodes. For load balancing within a heterogeneous multicore node, we propose a Data Partitioning technique between cores. Since parallel processes interfere with each other through shared memory, the speed of individual cores cannot be measured independently, and independent performance models cannot be defined for cores. Therefore, for a given problem size, we dynamically evaluate the performance of cores, while they are executing only the computational kernel of parallel application, and partition Data proportionally to the observed speed.

Xiao Song - One of the best experts on this subject based on the ideXlab platform.

  • Data Partitioning method based on picture content
    Journal of Xidian University, 2006
    Co-Authors: Du Jian-chao, Xiao Song
    Abstract:

    By analyzing the Data Partitioning tool in H.264,a novel Data Partitioning method based on the content of a picture is presented.It regroups video bitstreams into three separate sub-bitstreams according to importance: header information,Data of Intra-macroblocks and part of Inter-macroblocks,Data of tie remainder of Inter-macroblocks.The two parts of Inter-macroblocks belonging to different sub-bitstreams are differentiated according to their impacts on the quality of the picture,and the number is decided by an optimal algorithm.Simulation results show that compared with the Data Partitioning tool in H.264,the proposed method is more adaptive to the change of picture content and network conditions.Together with unequal error protection,it can improve the quality of video streaming to a great extent.

  • Robust Video Communications Based on Data Partitioning
    2006
    Co-Authors: Xiao Song
    Abstract:

    To prevent error propagation when transmitting video streaming over wireless networks,this paper presents a novel Data Partitioning method based on motion estimation.Macroblock is treated as basic unit in this method,and all macroblocks within a coded frame are differentiated into different levels of importance according to their own impact factor which is defined based on combination of two types of impact: one is the impact of a macroblock on the quality of next frame and the other is the impact on the quality of current frame.The first impact is evaluated by counting the total of referred times of all pixel within the macroblock and the second one is evaluated by calculating difference between original and predicted macroblock.Then,combined with header information,all macroblocks of a coded frame is partitioned into three types of partitions.Finally,different rates of FEC channel code are applied to different partitions so that the important Data can be protected better.While improving the quality of video like other Data Partitioning methods,an advantage of the proposed method is to balance the quality of the current and the next frame.Compared with other Data Partitioning methods such as in H.263++ and H.264,the proposed method can efficiently limit error and prevent error propagation.Experimental results show the proposed method has obtained stable quality for entire video sequence and achieve better PSNR with the two-state Markov channel model,and the gains are more than 0.3dB.

Fusheng Wang - One of the best experts on this subject based on the ideXlab platform.

  • Effective Spatial Data Partitioning for Scalable Query Processing.
    arXiv: Databases, 2015
    Co-Authors: Ablimit Aji, Fusheng Wang
    Abstract:

    Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial Data processing and analytics. MapReduce based systems achieve massive scalability by Partitioning the Data and running query tasks on those partitions in parallel. Therefore, effective Data Partitioning is critical for task parallelization, load balancing, and directly affects system performance. However, several pitfalls of spatial Data Partitioning make this task particularly challenging. First, Data skew is very common in spatial applications. To achieve best query performance, Data skew need to be reduced. Second, spatial Partitioning approaches generate boundary objects that cross multiple partitions, and add extra query processing overhead. Consequently, boundary objects need to be minimized. Third, the high computational complexity of spatial Partitioning algorithms combined with massive amounts of Data require an efficient approach for Partitioning to achieve overall fast query response. In this paper, we provide a systematic evaluation of multiple spatial Partitioning methods with a set of different Partitioning strategies, and study their implications on the performance of MapReduce based spatial queries. We also study sampling based Partitioning methods and their impact on queries, and propose several MapReduce based high performance spatial Partitioning methods. The main objective of our work is to provide a comprehensive guidance for optimal spatial Data Partitioning to support scalable and fast spatial Data processing in massively parallel Data processing frameworks such as MapReduce. The algorithms developed in this work are open source and can be easily integrated into different high performance spatial Data processing systems.

  • sato a spatial Data Partitioning framework for scalable query processing
    Advances in Geographic Information Systems, 2014
    Co-Authors: Ablimit Aji, Fusheng Wang
    Abstract:

    Scalable spatial query processing relies on effective spatial Data Partitioning for query parallelization, Data pruning, and load balancing. These are often challenged by the intrinsic characteristics of spatial Data, such as high skew in Data distribution and high complexity of irregular multi-dimensional objects. In this demo, we present SATO, a spatial Data Partitioning framework that can quickly analyze and partition spatial Data with an optimal spatial Partitioning strategy for scalable query processing. SATO works in following steps: 1) Sample, which samples a small fraction of input Data for analysis, 2) Analyze, which quickly analyzes sampled Data to find an optimal partition strategy, 3) Tear, which provides Data skew aware Partitioning and supports MapReduce based scalable Partitioning, and 4) Optimize, which collects succinct partition statistics for potential query optimization. SATO also provides multiple level Partitioning, which can be used to significantly improve window based queries in cloud based spatial query processing systems. SATO comes with a visualization component that provides heat maps and histograms for qualitative evaluation. SATO has been implemented within the Hadoop-GIS, a high performance spatial Data warehousing system over MapReduce. SATO is also released as an independent software package to support various scalable spatial query processing systems. Our experiments have demonstrated that SATO can generate much balanced Partitioning that can significantly improve spatial query performance with MapReduce comparing to traditional spatial Partitioning approaches.