Dataframe

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 450 Experts worldwide ranked by ideXlab platform

Tomer Kaftan - One of the best experts on this subject based on the ideXlab platform.

  • SIGMOD Conference - Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Tomer Kaftan, Joseph K. Bradley, Xiangrui Meng, Cheng Lian, Michael J. Franklin, Ali Ghodsi
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

  • Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Joseph K. Bradley, Reynold S Xin, Ali Ghodsi, Davies Liu, Xiangrui Meng, Matei Zaharia, Cheng Lian, Tomer Kaftan
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Michael Armbrust - One of the best experts on this subject based on the ideXlab platform.

  • SIGMOD Conference - Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
    Proceedings of the 2018 International Conference on Management of Data - SIGMOD '18, 2018
    Co-Authors: Michael Armbrust, Ali Ghodsi, Ion Stoica, Joseph Torres, Burak Yavuz, Matei Zaharia
    Abstract:

    With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or Dataframes), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

  • SIGMOD Conference - Introduction to Spark 2.0 for Database Researchers
    Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16, 2016
    Co-Authors: Michael Armbrust, Doug Bateman, Matei Zaharia
    Abstract:

    Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including Dataframes, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to "hack" Spark by extending its query optimizer to speed up distributed join execution.

  • SIGMOD Conference - Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Tomer Kaftan, Joseph K. Bradley, Xiangrui Meng, Cheng Lian, Michael J. Franklin, Ali Ghodsi
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

  • Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Joseph K. Bradley, Reynold S Xin, Ali Ghodsi, Davies Liu, Xiangrui Meng, Matei Zaharia, Cheng Lian, Tomer Kaftan
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Ali Ghodsi - One of the best experts on this subject based on the ideXlab platform.

  • SIGMOD Conference - Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
    Proceedings of the 2018 International Conference on Management of Data - SIGMOD '18, 2018
    Co-Authors: Michael Armbrust, Ali Ghodsi, Ion Stoica, Joseph Torres, Burak Yavuz, Matei Zaharia
    Abstract:

    With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or Dataframes), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

  • SIGMOD Conference - SparkR: Scaling R Programs with Spark
    Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16, 2016
    Co-Authors: Shivaram Venkataraman, Ali Ghodsi, Xiangrui Meng, Zongheng Yang, Eric Liang, Hossein Falaki, Michael J. Franklin, Ion Stoica
    Abstract:

    R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level Dataframe API enables scalable computation and present some of the key details of our implementation.

  • SIGMOD Conference - Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Tomer Kaftan, Joseph K. Bradley, Xiangrui Meng, Cheng Lian, Michael J. Franklin, Ali Ghodsi
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

  • Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Joseph K. Bradley, Reynold S Xin, Ali Ghodsi, Davies Liu, Xiangrui Meng, Matei Zaharia, Cheng Lian, Tomer Kaftan
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Matei Zaharia - One of the best experts on this subject based on the ideXlab platform.

  • SIGMOD Conference - Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
    Proceedings of the 2018 International Conference on Management of Data - SIGMOD '18, 2018
    Co-Authors: Michael Armbrust, Ali Ghodsi, Ion Stoica, Joseph Torres, Burak Yavuz, Matei Zaharia
    Abstract:

    With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or Dataframes), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

  • Spark: The Definitive Guide: Big Data Processing Made Simple
    2018
    Co-Authors: Bill Chambers, Matei Zaharia
    Abstract:

    Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. Youll explore the basic operations and common functions of Sparks structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparks scalable machine-learning library. Get a gentle overview of big data and Spark Learn about Dataframes, SQL, and Datasets Sparks core APIsthrough worked examples Dive into Sparks low-level APIs, RDDs, and execution of SQL and Data Frames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Sparks stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation

  • SIGMOD Conference - Introduction to Spark 2.0 for Database Researchers
    Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16, 2016
    Co-Authors: Michael Armbrust, Doug Bateman, Matei Zaharia
    Abstract:

    Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including Dataframes, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to "hack" Spark by extending its query optimizer to speed up distributed join execution.

  • Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Joseph K. Bradley, Reynold S Xin, Ali Ghodsi, Davies Liu, Xiangrui Meng, Matei Zaharia, Cheng Lian, Tomer Kaftan
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

Xiangrui Meng - One of the best experts on this subject based on the ideXlab platform.

  • SIGMOD Conference - SparkR: Scaling R Programs with Spark
    Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16, 2016
    Co-Authors: Shivaram Venkataraman, Ali Ghodsi, Xiangrui Meng, Zongheng Yang, Eric Liang, Hossein Falaki, Michael J. Franklin, Ion Stoica
    Abstract:

    R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level Dataframe API enables scalable computation and present some of the key details of our implementation.

  • SIGMOD Conference - Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Tomer Kaftan, Joseph K. Bradley, Xiangrui Meng, Cheng Lian, Michael J. Franklin, Ali Ghodsi
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

  • Spark SQL: Relational Data Processing in Spark
    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15, 2015
    Co-Authors: Michael Armbrust, Yin Huai, Joseph K. Bradley, Reynold S Xin, Ali Ghodsi, Davies Liu, Xiangrui Meng, Matei Zaharia, Cheng Lian, Tomer Kaftan
    Abstract:

    Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative Dataframe API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.