Data Virtualization

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 285 Experts worldwide ranked by ideXlab platform

Rick F. Van Der Lans - One of the best experts on this subject based on the ideXlab platform.

  • Data Virtualization and Master Data Management
    Data Virtualization for Business Intelligence Systems, 2012
    Co-Authors: Rick F. Van Der Lans
    Abstract:

    The quality of decision making depends to a degree on the quality of the Data used to make the decision. Incorrect Data can negatively influence the quality of decision making. One area within the IT that deals with the quality of Data is master Data management. Therefore, master Data management has a close relationship with Data integration and thus with Data Virtualization.

  • Deploying Data Virtualization in Business Intelligence Systems
    Data Virtualization for Business Intelligence Systems, 2012
    Co-Authors: Rick F. Van Der Lans
    Abstract:

    In this chapter, the topics of business intelligence and Data Virtualization are brought together. It describes how a Data Virtualization server can be used in a business intelligence system. The advantages and disadvantages of Data Virtualization are described. In addition, the following questions are answered: Why does deploying Data Virtualization in a business intelligence system make the latter more agile? What are the different application areas of Data Virtualization? What are the strategies for adopting Data Virtualization?

  • Design Guidelines for Data Virtualization
    Data Virtualization for Business Intelligence Systems, 2012
    Co-Authors: Rick F. Van Der Lans
    Abstract:

    This chapter describes a number of guidelines for designing a business intelligence system based on Data Virtualization. These design guidelines are based on experiences from real-life projects in which Data Virtualization was deployed. They address the following questions:

  • Introduction to Data Virtualization
    Data Virtualization for Business Intelligence Systems, 2012
    Co-Authors: Rick F. Van Der Lans
    Abstract:

    This chapter explains how Data Virtualization can be used to develop more agile business intelligence systems. By applying Data Virtualization, it will become easier to change systems. New reports can be developed, and existing reports can be adapted more easily and quickly. This agility is an important aspect for users of business intelligence systems. Their world is changing faster and faster, and therefore their business intelligence systems have to change at the same pace.

  • Data Virtualization Server: Caching of Virtual Tables
    Data Virtualization for Business Intelligence Systems, 2012
    Co-Authors: Rick F. Van Der Lans
    Abstract:

    A Data Virtualization server consumes cpu cycles and therefore increases the response time of queries executed by the Data consumers. However, for most queries the added amount of processing time is minimal. The performance of a query is determined by the amount of time consumed by the Data Virtualization server plus the time used by the underlying Data store(s), of which the former will only consume a small fraction and the latter most of the processing time. A Data Virtualization server can deploy several techniques to improve the performance of queries. These techniques can be classified in two groups: caching and query optimization. This chapter describes caching.

Gagan Agrawal - One of the best experts on this subject based on the ideXlab platform.

  • Automatic and efficient Data Virtualization system for scientific Datasets
    2006
    Co-Authors: Gagan Agrawal, L. Weng
    Abstract:

    There are a number of reasons why efficient access and high performance processing on scientific Datasets are challenging. First, scientific Datasets are typically stored as binary or character flat-files. Second, Data servers need to efficiently serve increasing number of clients and types of queries as more Data come online. To address these issues, we concentrated on the following areas: (1) Realizing Data Virtualization through automatically generated Data services over scientific Datasets. (2) Supporting Data analysis processing by means of SQL-3 query and aggregations for the Data Virtualization system. (3) Designing new techniques toward efficient execution of Data analysis queries using space partitioned partial replicas. (4) Generalizing the functionalities of the replica selection module according to two significant extensions. (5) Exploring the performance optimization potential of multiple queries over massive Datasets. In view of the first challenge, we have developed a meta-Data descriptor and compiler-oriented Data Virtualization System. We designed a meta-Data description language that is used for specifying low-level characteristics of Datasets. A scientist could explore a subset of interest and apply complex processing over them using declarative SQL-3 query and aggregations. Compiler algorithms using meta-Data descriptor and analyzing aggregations were developed for generating efficient Data subsetting service and Data aggregation service automatically. In view of the second challenge, we investigated one type of optimization techniques---Partial Replication. We proposed and implemented a greedy algorithm based on a cost metric to choose a best combination of partial replicas. Moreover, to generalize the work into a more realistic environment setting, we extended it for range and aggregate queries with both of space partitioned and attribute partitioned partial replicas. They could be unevenly or uniformly stored across distributed storage units. Using a new cost metric, a composite replica selection algorithm comprising of a set of dynamic programming strategy and greedy strategies are devised to resolve this problem. Finally, we further explore the optimization potential of executing multiple queries over massive Datasets. These techniques are implemented into a Replica Selection Module which is coupled tightly with the overall architecture of our Automatic Data Virtualization System.

  • LCPC - Supporting XML based high-level abstractions on HDF5 Datasets: a case study in automatic Data Virtualization
    Lecture Notes in Computer Science, 2005
    Co-Authors: Swarup Kumar Sahoo, Gagan Agrawal
    Abstract:

    Recently, we have been focusing on the notion of automatic Data Virtualization. The goal is to enable automatic creation of efficient Data services to support a high-level or virtual view of the Data. The application developers express the processing assuming this virtual view, whereas the Data is stored in a low-level format. The compiler uses the information about the low-level layout and the relationship between the virtual and the low-level layouts to generate efficient low-level Data processing code. In this paper, we describe a specific implementation of this approach. We provide XML-based abstractions on Datasets stored in the Hierarchical Data Format (HDF). A high-level XML Schema provides a logical view on the HDF5 Dataset, hiding actual layout details. Based on this view, the processing is specified using XQuery, which is the XML Query language developed by the World Wide Web Consortium (W3C). The HDF5 Data layout is exposed to the compiler using low-level XML Schema. The relationship between the high-level and low-level Schemas is exposed using a Mapping Schema. We describe how our compiler can generate efficient code to access and process HDF5 Datasets using the above information. A number of issues are addressed for ensuring high locality in processing of the Datasets, which arise mainly because of the high-level nature of XQuery and because the actual Data layout is abstracted.

  • supporting xml based high level abstractions on hdf5 Datasets a case study in automatic Data Virtualization
    IEEE International Conference on High Performance Computing Data and Analytics, 2004
    Co-Authors: Swarup Kumar Sahoo, Gagan Agrawal
    Abstract:

    Recently, we have been focusing on the notion of automatic Data Virtualization. The goal is to enable automatic creation of efficient Data services to support a high-level or virtual view of the Data. The application developers express the processing assuming this virtual view, whereas the Data is stored in a low-level format. The compiler uses the information about the low-level layout and the relationship between the virtual and the low-level layouts to generate efficient low-level Data processing code. In this paper, we describe a specific implementation of this approach. We provide XML-based abstractions on Datasets stored in the Hierarchical Data Format (HDF). A high-level XML Schema provides a logical view on the HDF5 Dataset, hiding actual layout details. Based on this view, the processing is specified using XQuery, which is the XML Query language developed by the World Wide Web Consortium (W3C). The HDF5 Data layout is exposed to the compiler using low-level XML Schema. The relationship between the high-level and low-level Schemas is exposed using a Mapping Schema. We describe how our compiler can generate efficient code to access and process HDF5 Datasets using the above information. A number of issues are addressed for ensuring high locality in processing of the Datasets, which arise mainly because of the high-level nature of XQuery and because the actual Data layout is abstracted.

  • An approach for automatic Data Virtualization
    Proceedings. 13th IEEE International Symposium on High performance Distributed Computing 2004., 2004
    Co-Authors: L. Weng, Umit Catalyurek, Gagan Agrawal, S. Narayanan, Joel Saltz
    Abstract:

    Analysis of large and/or geographically distributed scientific Datasets is emerging as a key component of grid computing. One challenge in this area is that scientific Datasets are typically stored as binary or character flat-files, which makes specification of processing much harder. In view of this, there has been recent interest in Data Virtualization, and Data services to support such Virtualization. This paper presents an approach for automatically creating Data services to support Data Virtualization. Specifically, we show how a relational table like Data abstraction can be supported for complex multidimensional scientific Datasets that are resident on a cluster. We have designed and implemented a tool that processes SQL queries (with select and where statements) on multi-dimensional Datasets. We have designed a meta-Data description language that is used for specifying the Data layout. From such description, our tool automatically generates efficient Data subsetting and access functions. We have extensively evaluated our system. The key observations from our experiments are as follows. First, our tool can correctly and efficiently handle a variety of different Data layouts. Second, our system scales well as the number of nodes or the amount of Data is scaled. Third, the performance of the automatically generated code for indexing and contracting functions is quite comparable to the performance of hand-written codes.

  • HPDC - An approach for automatic Data Virtualization
    2004
    Co-Authors: L. Weng, Gagan Agrawal, Umit Catalyurek, S. Narayanan, T. Kur, Joel H. Saltz
    Abstract:

    Analysis of large and/or geographically distributed scientific Datasets is emerging as a key component of grid computing. One challenge in this area is that scientific Datasets are typically stored as binary or character flat-files, which makes specification of processing much harder. In view of this, there has been recent interest in Data Virtualization, and Data services to support such Virtualization. This paper presents an approach for automatically creating Data services to support Data Virtualization. Specifically, we show how a relational table like Data abstraction can be supported for complex multidimensional scientific Datasets that are resident on a cluster. We have designed and implemented a tool that processes SQL queries (with select and where statements) on multi-dimensional Datasets. We have designed a meta-Data description language that is used for specifying the Data layout. From such description, our tool automatically generates efficient Data subsetting and access functions. We have extensively evaluated our system. The key observations from our experiments are as follows. First, our tool can correctly and efficiently handle a variety of different Data layouts. Second, our system scales well as the number of nodes or the amount of Data is scaled. Third, the performance of the automatically generated code for indexing and contracting functions is quite comparable to the performance of hand-written codes.

Anastasia Ailamaki - One of the best experts on this subject based on the ideXlab platform.

  • just in time Data Virtualization lightweight Data management with vida
    Conference on Innovative Data Systems Research, 2015
    Co-Authors: Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, Anastasia Ailamaki
    Abstract:

    As the size of Data and its heterogeneity increase, traditional Database system architecture becomes an obstacle to Data analysis. Integrating and ingesting (loading) Data into Databases is quickly becoming a bottleneck in face of massive Data as well as increasingly heterogeneous Data formats. Still, state-of-the-art approaches typically rely on copying and transforming Data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to Data. As Data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads Data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of Virtualization, i.e., abstracting Data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw Datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous Data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

  • CIDR - Just-In-Time Data Virtualization: Lightweight Data Management with ViDa
    2015
    Co-Authors: Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, Anastasia Ailamaki
    Abstract:

    As the size of Data and its heterogeneity increase, traditional Database system architecture becomes an obstacle to Data analysis. Integrating and ingesting (loading) Data into Databases is quickly becoming a bottleneck in face of massive Data as well as increasingly heterogeneous Data formats. Still, state-of-the-art approaches typically rely on copying and transforming Data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to Data. As Data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads Data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of Virtualization, i.e., abstracting Data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw Datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous Data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

Manos Karpathiotakis - One of the best experts on this subject based on the ideXlab platform.

  • just in time Data Virtualization lightweight Data management with vida
    Conference on Innovative Data Systems Research, 2015
    Co-Authors: Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, Anastasia Ailamaki
    Abstract:

    As the size of Data and its heterogeneity increase, traditional Database system architecture becomes an obstacle to Data analysis. Integrating and ingesting (loading) Data into Databases is quickly becoming a bottleneck in face of massive Data as well as increasingly heterogeneous Data formats. Still, state-of-the-art approaches typically rely on copying and transforming Data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to Data. As Data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads Data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of Virtualization, i.e., abstracting Data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw Datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous Data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

  • CIDR - Just-In-Time Data Virtualization: Lightweight Data Management with ViDa
    2015
    Co-Authors: Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, Anastasia Ailamaki
    Abstract:

    As the size of Data and its heterogeneity increase, traditional Database system architecture becomes an obstacle to Data analysis. Integrating and ingesting (loading) Data into Databases is quickly becoming a bottleneck in face of massive Data as well as increasingly heterogeneous Data formats. Still, state-of-the-art approaches typically rely on copying and transforming Data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to Data. As Data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads Data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of Virtualization, i.e., abstracting Data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa’s query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw Datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous Data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis.

Swarup Kumar Sahoo - One of the best experts on this subject based on the ideXlab platform.

  • LCPC - Supporting XML based high-level abstractions on HDF5 Datasets: a case study in automatic Data Virtualization
    Lecture Notes in Computer Science, 2005
    Co-Authors: Swarup Kumar Sahoo, Gagan Agrawal
    Abstract:

    Recently, we have been focusing on the notion of automatic Data Virtualization. The goal is to enable automatic creation of efficient Data services to support a high-level or virtual view of the Data. The application developers express the processing assuming this virtual view, whereas the Data is stored in a low-level format. The compiler uses the information about the low-level layout and the relationship between the virtual and the low-level layouts to generate efficient low-level Data processing code. In this paper, we describe a specific implementation of this approach. We provide XML-based abstractions on Datasets stored in the Hierarchical Data Format (HDF). A high-level XML Schema provides a logical view on the HDF5 Dataset, hiding actual layout details. Based on this view, the processing is specified using XQuery, which is the XML Query language developed by the World Wide Web Consortium (W3C). The HDF5 Data layout is exposed to the compiler using low-level XML Schema. The relationship between the high-level and low-level Schemas is exposed using a Mapping Schema. We describe how our compiler can generate efficient code to access and process HDF5 Datasets using the above information. A number of issues are addressed for ensuring high locality in processing of the Datasets, which arise mainly because of the high-level nature of XQuery and because the actual Data layout is abstracted.

  • supporting xml based high level abstractions on hdf5 Datasets a case study in automatic Data Virtualization
    IEEE International Conference on High Performance Computing Data and Analytics, 2004
    Co-Authors: Swarup Kumar Sahoo, Gagan Agrawal
    Abstract:

    Recently, we have been focusing on the notion of automatic Data Virtualization. The goal is to enable automatic creation of efficient Data services to support a high-level or virtual view of the Data. The application developers express the processing assuming this virtual view, whereas the Data is stored in a low-level format. The compiler uses the information about the low-level layout and the relationship between the virtual and the low-level layouts to generate efficient low-level Data processing code. In this paper, we describe a specific implementation of this approach. We provide XML-based abstractions on Datasets stored in the Hierarchical Data Format (HDF). A high-level XML Schema provides a logical view on the HDF5 Dataset, hiding actual layout details. Based on this view, the processing is specified using XQuery, which is the XML Query language developed by the World Wide Web Consortium (W3C). The HDF5 Data layout is exposed to the compiler using low-level XML Schema. The relationship between the high-level and low-level Schemas is exposed using a Mapping Schema. We describe how our compiler can generate efficient code to access and process HDF5 Datasets using the above information. A number of issues are addressed for ensuring high locality in processing of the Datasets, which arise mainly because of the high-level nature of XQuery and because the actual Data layout is abstracted.