Graph Analytics

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 4665 Experts worldwide ranked by ideXlab platform

Keshav Pingali - One of the best experts on this subject based on the ideXlab platform.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    Operating Systems Review, 2021
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems use a built-in partitioner that incorporates a particular partitioning policy, but the best policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a few policies. CuSP is a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and quickly generates highquality Graph partitions. For example, it can partition wdc12, the largest publicly available web-crawl Graph with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-theart stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • a study of Graph Analytics for massive datasets on distributed multi gpus
    International Parallel and Distributed Processing Symposium, 2020
    Co-Authors: Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, Krishna V Nandivada, Keshav Pingali
    Abstract:

    There are relatively few studies of distributed GPU Graph Analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication.In this paper, we present the first detailed analysis of Graph Analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU Graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.

  • gluon async a bulk asynchronous system for distributed and heterogeneous Graph Analytics
    International Conference on Parallel Architectures and Compilation Techniques, 2019
    Co-Authors: Roshan Dathathri, Keshav Pingali, Gurbinder Gill, Loc Hoang, Hoangvu Dang, Vishwesh Jatala, Krishna V Nandivada, Marc Snir
    Abstract:

    Distributed Graph Analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed Graph Analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous Graph Analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed Graph Analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU Graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter Graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU Graph Analytics system.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    International Parallel and Distributed Processing Symposium, 2019
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality Graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl Graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory.
    arXiv: Distributed Parallel and Cluster Computing, 2019
    Co-Authors: Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, Keshav Pingali
    Abstract:

    Intel Optane DC Persistent Memory (Optane PMM) is a new kind of byte-addressable memory with higher density and lower cost than DRAM. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this paper, we present key runtime and algorithmic principles to consider when performing Graph Analytics on extreme-scale Graphs on large-memory platforms of this sort. To demonstrate the importance of these principles, we evaluate four existing shared-memory Graph frameworks on large real-world web-crawls, using a machine with 6TB of Optane PMM. Our results show that frameworks based on the runtime and algorithmic principles advocated in this paper (i) perform significantly better than the others, and (ii) are competitive with Graph Analytics frameworks running on large production clusters.

Roshan Dathathri - One of the best experts on this subject based on the ideXlab platform.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    Operating Systems Review, 2021
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems use a built-in partitioner that incorporates a particular partitioning policy, but the best policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a few policies. CuSP is a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and quickly generates highquality Graph partitions. For example, it can partition wdc12, the largest publicly available web-crawl Graph with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-theart stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • distributed training of embeddings using Graph Analytics
    International Parallel and Distributed Processing Symposium, 2021
    Co-Authors: Gurbinder Gill, Roshan Dathathri, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi
    Abstract:

    Many applications today, such as natural language processing, network and code analysis, rely on semantically embedding objects into low-dimensional fixed-length vectors. Such embeddings naturally provide a way to perform useful downstream tasks, such as identifying relations among objects and predicting objects for a given context. Unfortunately, training accurate embeddings is usually computationally intensive and requires processing large amounts of data. This paper presents a distributed training framework for a class of applications that use Skip-gram-like models to generate embeddings. We call this class Any2Vec and it includes Word2Vec (Gensim), and Vertex2Vec (DeepWalk and Node2Vec) among others. We first formulate Any2Vec training algorithm as a Graph application. We then adapt the state-of-the-art distributed Graph Analytics framework, D-Galois, to support dynamic Graph generation and re-partitioning, and incorporate novel communication optimizations. We show that on a cluster of 3248-core hosts our framework GraphAny2Vec matches the accuracy of the state-of-the-art shared-memory implementations of Word2Vec and Vertex2Vec, and gives geo-mean speedups of 12 $\times$ and 5 $\times$ respectively. Furthermore, GraphAny2Vec is on average 2 $\times$ faster than DMTK, the state-of-the-art distributed Word2Vec implementation, on 32 hosts while yielding much better accuracy.

  • evaluation of Graph Analytics frameworks using the gap benchmark suite
    IEEE International Symposium on Workload Characterization, 2020
    Co-Authors: Ariful Azad, Roshan Dathathri, Mohsen Mahmoudi Aznaveh, Scott Beamer, Mark Blanco, Jinhao Chen, Luke Dalessandro, Timothy A Davis, Kevin Deweese, Jesun Sahariar Firoz
    Abstract:

    Graphs play a key role in data Analytics. Graphs and the software systems used to work with them are highly diverse. Algorithms interact with hardware in different ways and which Graph solution works best on a given platform changes with the structure of the Graph. This makes it difficult to decide which Graph programming framework is the best for a given situation. In this paper, we try to make sense of this diverse landscape. We evaluate five different frameworks for Graph Analytics: SuiteS-parse GraphBLAS, Galois, the NWGraph library, the Graph Kernel Collection, and GraphIt. We use the GAP Benchmark Suite to evaluate each framework. GAP consists of 30 tests: six Graph algorithms (breadth-first search, single-source shortest path, PageRank, betweenness centrality, connected components, and triangle counting) on five Graphs. The GAP Benchmark Suite includes high-performance reference implementations to provide a performance baseline for comparison. Our results show the relative strengths of each framework, but also serve as a case study for the challenges of establishing objective measures for comparing Graph frameworks.

  • a study of Graph Analytics for massive datasets on distributed multi gpus
    International Parallel and Distributed Processing Symposium, 2020
    Co-Authors: Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, Krishna V Nandivada, Keshav Pingali
    Abstract:

    There are relatively few studies of distributed GPU Graph Analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication.In this paper, we present the first detailed analysis of Graph Analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU Graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.

  • gluon async a bulk asynchronous system for distributed and heterogeneous Graph Analytics
    International Conference on Parallel Architectures and Compilation Techniques, 2019
    Co-Authors: Roshan Dathathri, Keshav Pingali, Gurbinder Gill, Loc Hoang, Hoangvu Dang, Vishwesh Jatala, Krishna V Nandivada, Marc Snir
    Abstract:

    Distributed Graph Analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed Graph Analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous Graph Analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed Graph Analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU Graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter Graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU Graph Analytics system.

Gurbinder Gill - One of the best experts on this subject based on the ideXlab platform.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    Operating Systems Review, 2021
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems use a built-in partitioner that incorporates a particular partitioning policy, but the best policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a few policies. CuSP is a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and quickly generates highquality Graph partitions. For example, it can partition wdc12, the largest publicly available web-crawl Graph with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-theart stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • distributed training of embeddings using Graph Analytics
    International Parallel and Distributed Processing Symposium, 2021
    Co-Authors: Gurbinder Gill, Roshan Dathathri, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi
    Abstract:

    Many applications today, such as natural language processing, network and code analysis, rely on semantically embedding objects into low-dimensional fixed-length vectors. Such embeddings naturally provide a way to perform useful downstream tasks, such as identifying relations among objects and predicting objects for a given context. Unfortunately, training accurate embeddings is usually computationally intensive and requires processing large amounts of data. This paper presents a distributed training framework for a class of applications that use Skip-gram-like models to generate embeddings. We call this class Any2Vec and it includes Word2Vec (Gensim), and Vertex2Vec (DeepWalk and Node2Vec) among others. We first formulate Any2Vec training algorithm as a Graph application. We then adapt the state-of-the-art distributed Graph Analytics framework, D-Galois, to support dynamic Graph generation and re-partitioning, and incorporate novel communication optimizations. We show that on a cluster of 3248-core hosts our framework GraphAny2Vec matches the accuracy of the state-of-the-art shared-memory implementations of Word2Vec and Vertex2Vec, and gives geo-mean speedups of 12 $\times$ and 5 $\times$ respectively. Furthermore, GraphAny2Vec is on average 2 $\times$ faster than DMTK, the state-of-the-art distributed Word2Vec implementation, on 32 hosts while yielding much better accuracy.

  • a study of Graph Analytics for massive datasets on distributed multi gpus
    International Parallel and Distributed Processing Symposium, 2020
    Co-Authors: Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, Krishna V Nandivada, Keshav Pingali
    Abstract:

    There are relatively few studies of distributed GPU Graph Analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication.In this paper, we present the first detailed analysis of Graph Analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU Graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.

  • gluon async a bulk asynchronous system for distributed and heterogeneous Graph Analytics
    International Conference on Parallel Architectures and Compilation Techniques, 2019
    Co-Authors: Roshan Dathathri, Keshav Pingali, Gurbinder Gill, Loc Hoang, Hoangvu Dang, Vishwesh Jatala, Krishna V Nandivada, Marc Snir
    Abstract:

    Distributed Graph Analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed Graph Analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous Graph Analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed Graph Analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU Graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter Graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU Graph Analytics system.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    International Parallel and Distributed Processing Symposium, 2019
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality Graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl Graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

Loc Hoang - One of the best experts on this subject based on the ideXlab platform.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    Operating Systems Review, 2021
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems use a built-in partitioner that incorporates a particular partitioning policy, but the best policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a few policies. CuSP is a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and quickly generates highquality Graph partitions. For example, it can partition wdc12, the largest publicly available web-crawl Graph with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-theart stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • a study of Graph Analytics for massive datasets on distributed multi gpus
    International Parallel and Distributed Processing Symposium, 2020
    Co-Authors: Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, Krishna V Nandivada, Keshav Pingali
    Abstract:

    There are relatively few studies of distributed GPU Graph Analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication.In this paper, we present the first detailed analysis of Graph Analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU Graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.

  • gluon async a bulk asynchronous system for distributed and heterogeneous Graph Analytics
    International Conference on Parallel Architectures and Compilation Techniques, 2019
    Co-Authors: Roshan Dathathri, Keshav Pingali, Gurbinder Gill, Loc Hoang, Hoangvu Dang, Vishwesh Jatala, Krishna V Nandivada, Marc Snir
    Abstract:

    Distributed Graph Analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported efficiently by current message transport layers, but bulk-synchronization can exacerbate the performance impact of load imbalance because a round cannot be completed until every host has completed that round. Asynchronous distributed Graph Analytics systems circumvent this problem by permitting hosts to make progress at their own pace, but existing systems either use global locks and send small messages or send large messages but do not support general partitioning policies such as vertex-cuts. Consequently, they perform substantially worse than bulk-synchronous systems. Moreover, none of their programming or execution models can be easily adapted for heterogeneous devices like GPUs. In this paper, we design and implement a lock-free, non-blocking, bulk-asynchronous runtime called Gluon-Async for distributed and heterogeneous Graph Analytics. The runtime supports any partitioning policy and uses bulk-communication. We present the bulk-asynchronous parallel (BASP) model which allows the programmer to utilize the runtime by specifying only the abstract communication required. Applications written in this model are compared with the BSP programs written using (1) D-Galois and D-IrGL, the state-of-the-art distributed Graph Analytics systems (which are bulk-synchronous) for CPUs and GPUs, respectively, and (2) Lux, another (bulk-synchronous) distributed GPU Graph analytical system. Our evaluation shows that programs written using BASP-style execution are on average ~1.5x faster than those in D-Galois and D-IrGL on real-world large-diameter Graphs at scale. They are also on average ~12x faster than Lux. To the best of our knowledge, Gluon-Async is the first asynchronous distributed GPU Graph Analytics system.

  • cusp a customizable streaming edge partitioner for distributed Graph Analytics
    International Parallel and Distributed Processing Symposium, 2019
    Co-Authors: Loc Hoang, Gurbinder Gill, Roshan Dathathri, Keshav Pingali
    Abstract:

    Graph Analytics systems must analyze Graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large Graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the Graph among the machines in the cluster. Existing Graph Analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input Graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone Graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality Graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl Graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

  • Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory.
    arXiv: Distributed Parallel and Cluster Computing, 2019
    Co-Authors: Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, Keshav Pingali
    Abstract:

    Intel Optane DC Persistent Memory (Optane PMM) is a new kind of byte-addressable memory with higher density and lower cost than DRAM. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this paper, we present key runtime and algorithmic principles to consider when performing Graph Analytics on extreme-scale Graphs on large-memory platforms of this sort. To demonstrate the importance of these principles, we evaluate four existing shared-memory Graph frameworks on large real-world web-crawls, using a machine with 6TB of Optane PMM. Our results show that frameworks based on the runtime and algorithmic principles advocated in this paper (i) perform significantly better than the others, and (ii) are competitive with Graph Analytics frameworks running on large production clusters.

Tyson Condie - One of the best experts on this subject based on the ideXlab platform.

  • Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
    arXiv: Databases, 2014
    Co-Authors: Vinayak Borkar, Jianfeng Jia, Michael J. Carey, Tyson Condie
    Abstract:

    There is a growing need for distributed Graph processing systems that are capable of gracefully scaling to very large Graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many Graph processing systems follow. Pregelix is a new open source distributed Graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics.

  • Pregelix: Big(ger) Graph Analytics on a dataflow engine
    Proceedings of the VLDB Endowment, 2014
    Co-Authors: Vinayak Borkar, Jianfeng Jia, Michael J. Carey, Tyson Condie
    Abstract:

    There is a growing need for distributed Graph processing systems that are capable of gracefully scaling to very large Graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many Graph processing systems follow. Pregelix is a new open source distributed Graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab), and more effective use of available machine resources to support Big(ger) Graph Analytics.