Data Integration Process

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 150195 Experts worldwide ranked by ideXlab platform

Markus Helfert - One of the best experts on this subject based on the ideXlab platform.

  • Data quality problems in TPC-DI based Data Integration Processes
    2018
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Many Data driven organisations need to integrate Data from multiple, distributed and heterogeneous resources for advanced Data analysis. A Data Integration system is an essential component to collect Data into a Data warehouse or other Data analytics systems. There are various alternatives of Data Integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating Data Integration systems. When using this benchmark, we find some typical Data quality problems in the TPC-DI Data source such as multi-meaning attributes and inconsistent Data schemas, which could delay or even fail the Data Integration Process. This paper explains Processes of this benchmark and summarises typical Data quality problems identified in the TPC-DI Data source. Furthermore, in order to prevent Data quality problems and proactively manage Data quality, we propose a set of practical guidelines for researchers and practitioners to conduct Data quality management when using the TPC-DI benchmark.

  • ICEIS (Revised Selected Papers) - Data Quality Problems in TPC-DI Based Data Integration Processes
    Enterprise Information Systems, 2018
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Many Data driven organisations need to integrate Data from multiple, distributed and heterogeneous resources for advanced Data analysis. A Data Integration system is an essential component to collect Data into a Data warehouse or other Data analytics systems. There are various alternatives of Data Integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating Data Integration systems. When using this benchmark, we find some typical Data quality problems in the TPC-DI Data source such as multi-meaning attributes and inconsistent Data schemas, which could delay or even fail the Data Integration Process. This paper explains Processes of this benchmark and summarises typical Data quality problems identified in the TPC-DI Data source. Furthermore, in order to prevent Data quality problems and proactively manage Data quality, we propose a set of practical guidelines for researchers and practitioners to conduct Data quality management when using the TPC-DI benchmark.

  • ICEIS (1) - Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark.
    Proceedings of the 19th International Conference on Enterprise Information Systems, 2017
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Nowadays, many business intelligence or master Data management initiatives are based on regular Data Integration, since Data Integration intends to extract and combine a variety of Data sources, it is thus considered as a prerequisite for Data analytics and management. More recently, TPC-DI is proposed as an industry benchmark for Data Integration. It is designed to benchmark the Data Integration and serve as a standardisation to evaluate the ETL performance. There are a variety of Data quality problems such as multi-meaning attributes and inconsistent Data schemas in source Data, which will not only cause problems for the Data Integration Process but also affect further Data mining or Data analytics. This paper has summarised typical Data quality problems in the Data Integration and adapted the traditional Data quality dimensions to classify those Data quality problems. We found that Data completeness, timeliness and consistency are critical for Data quality management in Data Integration, and Data consistency should be further defined in the pragmatic level. In order to prevent typical Data quality problems and proactively manage Data quality in ETL, we proposed a set of practical guidelines for researchers and practitioners to conduct Data quality management in Data Integration

  • Guidelines of Data quality issues for Data Integration in the context of the TPC-DI benchmark
    2017
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Nowadays, many business intelligence or master Data management initiatives are based on regular Data Integration, since Data Integration intends to extract and combine a variety of Data sources, it is thus considered as a prerequisite for Data analytics and management. More recently, TPC-DI is proposed as an industry benchmark for Data Integration. It is designed to benchmark the Data Integration and serve as a standardisation to evaluate the ETL performance. There are a variety of Data quality problems such as multi-meaning attributes and inconsistent Data schemas in source Data, which will not only cause problems for the Data Integration Process but also affect further Data mining or Data analytics. This paper has summarised typical Data quality problems in the Data Integration and adapted the traditional Data quality dimensions to classify those Data quality problems. We found that Data completeness, timeliness and consistency are critical for Data quality management in Data Integration, and Data consistency should be further defined in the pragmatic level. In order to prevent typical Data quality problems and proactively manage Data quality in ETL, we proposed a set of practical guidelines for researchers and practitioners to conduct Data quality management in Data Integration.

Qishan Yang - One of the best experts on this subject based on the ideXlab platform.

  • Data quality problems in TPC-DI based Data Integration Processes
    2018
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Many Data driven organisations need to integrate Data from multiple, distributed and heterogeneous resources for advanced Data analysis. A Data Integration system is an essential component to collect Data into a Data warehouse or other Data analytics systems. There are various alternatives of Data Integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating Data Integration systems. When using this benchmark, we find some typical Data quality problems in the TPC-DI Data source such as multi-meaning attributes and inconsistent Data schemas, which could delay or even fail the Data Integration Process. This paper explains Processes of this benchmark and summarises typical Data quality problems identified in the TPC-DI Data source. Furthermore, in order to prevent Data quality problems and proactively manage Data quality, we propose a set of practical guidelines for researchers and practitioners to conduct Data quality management when using the TPC-DI benchmark.

  • ICEIS (Revised Selected Papers) - Data Quality Problems in TPC-DI Based Data Integration Processes
    Enterprise Information Systems, 2018
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Many Data driven organisations need to integrate Data from multiple, distributed and heterogeneous resources for advanced Data analysis. A Data Integration system is an essential component to collect Data into a Data warehouse or other Data analytics systems. There are various alternatives of Data Integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating Data Integration systems. When using this benchmark, we find some typical Data quality problems in the TPC-DI Data source such as multi-meaning attributes and inconsistent Data schemas, which could delay or even fail the Data Integration Process. This paper explains Processes of this benchmark and summarises typical Data quality problems identified in the TPC-DI Data source. Furthermore, in order to prevent Data quality problems and proactively manage Data quality, we propose a set of practical guidelines for researchers and practitioners to conduct Data quality management when using the TPC-DI benchmark.

  • ICEIS (1) - Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark.
    Proceedings of the 19th International Conference on Enterprise Information Systems, 2017
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Nowadays, many business intelligence or master Data management initiatives are based on regular Data Integration, since Data Integration intends to extract and combine a variety of Data sources, it is thus considered as a prerequisite for Data analytics and management. More recently, TPC-DI is proposed as an industry benchmark for Data Integration. It is designed to benchmark the Data Integration and serve as a standardisation to evaluate the ETL performance. There are a variety of Data quality problems such as multi-meaning attributes and inconsistent Data schemas in source Data, which will not only cause problems for the Data Integration Process but also affect further Data mining or Data analytics. This paper has summarised typical Data quality problems in the Data Integration and adapted the traditional Data quality dimensions to classify those Data quality problems. We found that Data completeness, timeliness and consistency are critical for Data quality management in Data Integration, and Data consistency should be further defined in the pragmatic level. In order to prevent typical Data quality problems and proactively manage Data quality in ETL, we proposed a set of practical guidelines for researchers and practitioners to conduct Data quality management in Data Integration

  • Guidelines of Data quality issues for Data Integration in the context of the TPC-DI benchmark
    2017
    Co-Authors: Qishan Yang, Markus Helfert
    Abstract:

    Nowadays, many business intelligence or master Data management initiatives are based on regular Data Integration, since Data Integration intends to extract and combine a variety of Data sources, it is thus considered as a prerequisite for Data analytics and management. More recently, TPC-DI is proposed as an industry benchmark for Data Integration. It is designed to benchmark the Data Integration and serve as a standardisation to evaluate the ETL performance. There are a variety of Data quality problems such as multi-meaning attributes and inconsistent Data schemas in source Data, which will not only cause problems for the Data Integration Process but also affect further Data mining or Data analytics. This paper has summarised typical Data quality problems in the Data Integration and adapted the traditional Data quality dimensions to classify those Data quality problems. We found that Data completeness, timeliness and consistency are critical for Data quality management in Data Integration, and Data consistency should be further defined in the pragmatic level. In order to prevent typical Data quality problems and proactively manage Data quality in ETL, we proposed a set of practical guidelines for researchers and practitioners to conduct Data quality management in Data Integration.

Janusz R Getta - One of the best experts on this subject based on the ideXlab platform.

  • query decomposition strategy for Integration of semistructured Data
    Information Integration and Web-based Applications & Services, 2014
    Co-Authors: Janusz R Getta
    Abstract:

    Data Integration systems provide a unified view of various sources of Data distributed over the wide-area networks. User requests issued at a central site must be decomposed into a number of sub-requests, that are later on Processed at the remote sites. The results are integrated at a central site and returned to a user. A decomposition strategy of global user requests and scheduling of sub-requests at a central site has a significant impact on performance of Data Integration Process. This paper proposes an efficient decomposition strategy for the systems that integrate semistructured Data. We define a new system of operations on XML documents to represent XQuery user requests and the results of decompositions of such requests. A cost-based optimisation is used to find the optimal size of sub-requests and their optimal scheduling at a central site.

  • iiWAS - Query Decomposition Strategy for Integration of Semistructured Data
    Proceedings of the 16th International Conference on Information Integration and Web-based Applications & Services, 2014
    Co-Authors: Handoko, Janusz R Getta
    Abstract:

    Data Integration systems provide a unified view of various sources of Data distributed over the wide-area networks. User requests issued at a central site must be decomposed into a number of sub-requests, that are later on Processed at the remote sites. The results are integrated at a central site and returned to a user. A decomposition strategy of global user requests and scheduling of sub-requests at a central site has a significant impact on performance of Data Integration Process. This paper proposes an efficient decomposition strategy for the systems that integrate semistructured Data. We define a new system of operations on XML documents to represent XQuery user requests and the results of decompositions of such requests. A cost-based optimisation is used to find the optimal size of sub-requests and their optimal scheduling at a central site.

Xiaogang Ma - One of the best experts on this subject based on the ideXlab platform.

  • ICDE - VisFlow: A Visual Database Integration and Workflow Querying System
    2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017
    Co-Authors: Hasan M. Jamil, Xiaogang Ma
    Abstract:

    The adoption and availability of diverse application design and support platforms are making generic scientific application orchestration increasingly difficult. In such an evolving environment, higher level abstractions of design primitives are critically important using which end users have a chance to craft their own applications without a complete technical grasp of the lower level details. In this research, we introduce a novel scientific workflow design platform that supports high level tools for Data Integration, Process description and analytics based on a visual language for naive users and advanced options for computing savvy programmers in one single platform, called VisFlow. We describe its salient features and advantages using a complex scientific application in natural resources and ecology. Video: https://youtu.be/ 2YSYVyOuuk.

Faouzia Wadjinny - One of the best experts on this subject based on the ideXlab platform.

  • Managing Network Dynamicity in a Vector Space Model for Semantic P2P Data Integration
    Communications in Computer and Information Science, 2011
    Co-Authors: Ahmed Moujane, Dalila Chiadmi, Laila Benhlima, Faouzia Wadjinny
    Abstract:

    P2P Data Integration is one of the prominent studies in recent years. It relies on two principal axes, including Data Integration and P2P computing. It aims to combine the advantages of Data Integration and P2P technologies to overcome centralized solutions shortcomings. However, dynamicity and large scale are the most difficult challenges faced for efficient solutions. In this paper, we investigate P2P computing and Data Integration fundamentals and detail the challenges that face the P2P Data Integration Process. In addition, we presenta vector space model based approach our P2P semantic Data Integration framework. In a first stage, we detail the various modules of our framework and specify the functions of each one. Then, we present our vector space model to represent semantic knowledge. We present also the knowledge base components that hold semantic. Finally, we explain how we deal with network dynamicity and how semantic should be adjusted accordingly.

  • AICCSA - A study in the P2P Data Integration Process
    2009 IEEE ACS International Conference on Computer Systems and Applications, 2009
    Co-Authors: Ahmed Moujane, Dalila Chiadmi, Laila Benhlima, Faouzia Wadjinny
    Abstract:

    In recent years, the issue of heterogeneity and Data sharing has been discussed in different contexts and according to diverse points of view. However, we can retain especially two significant axes which are Data Integration and P2P computing. Data Integration aims to hide heterogeneities of distributed sources. However, most of Data Integration solutions are centralized-based architecture. The birth of P2P technologies has changed the way of managing distributed Data and gives more scalability and flexibility. The combination of the advantages of Data Integration and P2P technologies would help to overcome centralized solutions, but the stake is not challenge-free. Thus, PDMS (Peer Data Management System) networks have a number of important advantages over previous, more flat P2P networks. In this paper, we will investigate some basic P2P computing and Data Integration notions and detail the challenges that face the P2P Data Integration Process.