Large Datasets

The Experts below are selected from a list of 71124 Experts worldwide ranked by ideXlab platform

Andrew Rice - One of the best experts on this subject based on the ideXlab platform.

MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Co-Authors: Daniel Hintze, Andrew Rice

Abstract:

There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

15 days free trial to Access Article
Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Co-Authors: Daniel Hintze, Andrew Rice

Abstract:

There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

15 days free trial to Access Article

Jim Downling - One of the best experts on this subject based on the ideXlab platform.

Dela — Sharing Large Datasets between Hadoop Clusters

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017

Co-Authors: Alexandru A. Ormenişan, Jim Downling

Abstract:

Big data has, in recent years, revolutionised an ever-growing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing Large Datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing `Big Data'. Existing Large-scale storage platforms, however, lack support for the efficient sharing of Large Datasets over the Internet. Those systems that are widely used for the dissemination of Large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for Large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of Large Datasets.

15 days free trial to Access Article
ICDCS - Dela — Sharing Large Datasets between Hadoop Clusters

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017

Co-Authors: Alexandru A. Ormenişan, Jim Downling

Abstract:

Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing Large Datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing ‘Big Data’. Existing Large-scale storage platforms, however, lack support for the efficient sharing of Large Datasets over the Internet. Those systems that are widely used for the dissemination of Large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for Large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of Large Datasets.

15 days free trial to Access Article

Kadim Taşdemir - One of the best experts on this subject based on the ideXlab platform.

The use of k-means++ for approximate spectral clustering of Large Datasets

2014 22nd Signal Processing and Communications Applications Conference (SIU), 2014

Co-Authors: Berna Yalçin, Kadim Taşdemir

Abstract:

Spectral clustering (SC) has been commonly used in recent years, thanks to its nonparametric model, its ability to extract clusters of different manifolds and its easy application. However, SC is infeasible for Large Datasets because of its high computational cost and memory requirement. To address this challenge, approximate spectral clustering (ASC) has been proposed for Large Datasets. ASC involves two steps: firstly limited number of data representatives (also known as prototypes) are selected by sampling or quantization methods, then SC is applied to these representatives using various similarity criteria. In this study, several quantization and sampling methods are compared for ASC. Among them, k-means++, which is a recently popular algorithm in clustering, is used to select prototypes in ASC for the first time. Experiments on different Datasets indicate that k-means++ is a suitable alternative to neural gas and selective sampling in terms of accuracy and computational cost.

15 days free trial to Access Article
vector quantization based approximate spectral clustering of Large Datasets

Pattern Recognition, 2012

Co-Authors: Kadim Taşdemir

Abstract:

Spectral partitioning, recently popular for unsupervised clustering, is infeasible for Large Datasets due to its computational complexity and memory requirement. Therefore, approximate spectral clustering of data representatives (selected by various sampling methods) was used. Alternatively, we propose to use neural networks (self-organizing maps and neural gas), which are shown successful in quantization with small distortion, as preliminary sampling for approximate spectral clustering (ASC). We show that they usually outperform k-means sampling (which was shown superior to various sampling methods), in terms of clustering accuracy obtained by ASC. More importantly, for quantization based ASC, we introduce a local density-based similarity measure - constructed without any user-set parameter - which achieves accuracies superior to the accuracies of commonly used distance based similarity.

15 days free trial to Access Article

Daniel Hintze - One of the best experts on this subject based on the ideXlab platform.

MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Co-Authors: Daniel Hintze, Andrew Rice

Abstract:

There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

15 days free trial to Access Article
Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

2016 IEEE 24th International Symposium on Modeling Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Co-Authors: Daniel Hintze, Andrew Rice

Abstract:

There is growing demand for researchers to share Datasets in order to allow others to reproduce results or investigate new questions. The most common option is to simply deposit the data online in its entirety. However, this mechanism of distribution becomes impractical as the size of the dataset increases or if the dataset is frequently changing as new data is collected. In this paper we describe Picky, a new Merkle tree based system for sharing Large Datasets which allows users to download selected portions and to receive incremental updates. We demonstrate the viability of our approach by quantifying its benefit when applied to a number of Large Datasets used in the networking and measurement community.

15 days free trial to Access Article

Nitin Chiluka - One of the best experts on this subject based on the ideXlab platform.

The out-of-core KNN awakens: the light side of computation force on Large Datasets

Computing, 2018

Co-Authors: Javier Olivares, Anne-marie Kermarrec, Nitin Chiluka

Abstract:

K-nearest neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for Large Datasets. We focus on the challenge of KNN computation over Large Datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on Large Datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory. We evaluate our approach on Large Datasets, in terms of performance and memory consumption. The evaluation shows that our approach requires only 7% of the time needed by an in-memory baseline to compute a KNN graph.

15 days free trial to Access Article
NETYS - The Out-of-core KNN Awakens: The light side of computation force on Large Datasets

Networked Systems, 2016

Co-Authors: Nitin Chiluka, Anne-marie Kermarrec, Javier Olivares

Abstract:

K-Nearest Neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for Large Datasets. We focus on the challenge of KNN computation over Large Datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on Large Datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory.

15 days free trial to Access Article

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Andrew Rice - One of the best experts on this subject based on the ideXlab platform.

MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

Jim Downling - One of the best experts on this subject based on the ideXlab platform.

Dela — Sharing Large Datasets between Hadoop Clusters

ICDCS - Dela — Sharing Large Datasets between Hadoop Clusters

Kadim Taşdemir - One of the best experts on this subject based on the ideXlab platform.

The use of k-means++ for approximate spectral clustering of Large Datasets

vector quantization based approximate spectral clustering of Large Datasets

Daniel Hintze - One of the best experts on this subject based on the ideXlab platform.

MASCOTS - Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

Picky: Efficient and Reproducible Sharing of Large Datasets Using Merkle-Trees

Nitin Chiluka - One of the best experts on this subject based on the ideXlab platform.

The out-of-core KNN awakens: the light side of computation force on Large Datasets

NETYS - The Out-of-core KNN Awakens: The light side of computation force on Large Datasets

Large Datasets

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

Related terms

Andrew Rice - One of the best experts on this subject based on the ideXlab platform.

Jim Downling - One of the best experts on this subject based on the ideXlab platform.

Kadim Taşdemir - One of the best experts on this subject based on the ideXlab platform.

Daniel Hintze - One of the best experts on this subject based on the ideXlab platform.

Nitin Chiluka - One of the best experts on this subject based on the ideXlab platform.

Related terms