Large Datasets

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 71124 Experts worldwide ranked by ideXlab platform

Andrew Rice - One of the best experts on this subject based on the ideXlab platform.

  • MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees
    2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016
    Co-Authors: Daniel Hintze, Andrew Rice
    Abstract:

    There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

  • Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees
    2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016
    Co-Authors: Daniel Hintze, Andrew Rice
    Abstract:

    There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

Jim Downling - One of the best experts on this subject based on the ideXlab platform.

  • Dela — Sharing Large Datasets between Hadoop Clusters
    2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017
    Co-Authors: Alexandru A. Ormenişan, Jim Downling
    Abstract:

    Big data has, in recent years, revolutionised an ever-growing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing Large Datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing `Big Data'. Existing Large-scale storage platforms, however, lack support for the efficient sharing of Large Datasets over the Internet. Those systems that are widely used for the dissemination of Large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for Large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of Large Datasets.

  • ICDCS - Dela — Sharing Large Datasets between Hadoop Clusters
    2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017
    Co-Authors: Alexandru A. Ormenişan, Jim Downling
    Abstract:

    Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing Large Datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing ‘Big Data’. Existing Large-scale storage platforms, however, lack support for the efficient sharing of Large Datasets over the Internet. Those systems that are widely used for the dissemination of Large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for Large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of Large Datasets.

Kadim Taşdemir - One of the best experts on this subject based on the ideXlab platform.

  • The use of k-means++ for approximate spectral clustering of Large Datasets
    2014 22nd Signal Processing and Communications Applications Conference (SIU), 2014
    Co-Authors: Berna Yalçin, Kadim Taşdemir
    Abstract:

    Spectral clustering (SC) has been commonly used in recent years, thanks to its nonparametric model, its ability to extract clusters of different manifolds and its easy application. However, SC is infeasible for Large Datasets because of its high computational cost and memory requirement. To address this challenge, approximate spectral clustering (ASC) has been proposed for Large Datasets. ASC involves two steps: firstly limited number of data representatives (also known as prototypes) are selected by sampling or quantization methods, then SC is applied to these representatives using various similarity criteria. In this study, several quantization and sampling methods are compared for ASC. Among them, k-means++, which is a recently popular algorithm in clustering, is used to select prototypes in ASC for the first time. Experiments on different Datasets indicate that k-means++ is a suitable alternative to neural gas and selective sampling in terms of accuracy and computational cost.

  • vector quantization based approximate spectral clustering of Large Datasets
    Pattern Recognition, 2012
    Co-Authors: Kadim Taşdemir
    Abstract:

    Spectral partitioning, recently popular for unsupervised clustering, is infeasible for Large Datasets due to its computational complexity and memory requirement. Therefore, approximate spectral clustering of data representatives (selected by various sampling methods) was used. Alternatively, we propose to use neural networks (self-organizing maps and neural gas), which are shown successful in quantization with small distortion, as preliminary sampling for approximate spectral clustering (ASC). We show that they usually outperform k-means sampling (which was shown superior to various sampling methods), in terms of clustering accuracy obtained by ASC. More importantly, for quantization based ASC, we introduce a local density-based similarity measure - constructed without any user-set parameter - which achieves accuracies superior to the accuracies of commonly used distance based similarity.

Daniel Hintze - One of the best experts on this subject based on the ideXlab platform.

  • MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees
    2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016
    Co-Authors: Daniel Hintze, Andrew Rice
    Abstract:

    There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

  • Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees
    2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016
    Co-Authors: Daniel Hintze, Andrew Rice
    Abstract:

    There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

Nitin Chiluka - One of the best experts on this subject based on the ideXlab platform.

  • The out-of-core KNN awakens: the light side of computation force on Large Datasets
    Computing, 2018
    Co-Authors: Javier Olivares, Anne-marie Kermarrec, Nitin Chiluka
    Abstract:

    K-nearest neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for Large Datasets. We focus on the challenge of KNN computation over Large Datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on Large Datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory. We evaluate our approach on Large Datasets, in terms of performance and memory consumption. The evaluation shows that our approach requires only 7% of the time needed by an in-memory baseline to compute a KNN graph.

  • NETYS - The Out-of-core KNN Awakens: The light side of computation force on Large Datasets
    Networked Systems, 2016
    Co-Authors: Nitin Chiluka, Anne-marie Kermarrec, Javier Olivares
    Abstract:

    K-Nearest Neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for Large Datasets. We focus on the challenge of KNN computation over Large Datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on Large Datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory.