Data Preprocessing

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 59988 Experts worldwide ranked by ideXlab platform

Francisco Herrera - One of the best experts on this subject based on the ideXlab platform.

  • DPASF: a flink library for streaming Data Preprocessing
    Big Data Analytics, 2019
    Co-Authors: Alejandro Alcalde-barros, Salvador García, Diego García-gil, Francisco Herrera
    Abstract:

    Background Data Preprocessing techniques are devoted to correcting or alleviating errors in Data. Discretization and feature selection are two of the most extended Data Preprocessing techniques. Although we can find many proposals for static Big Data Preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch Data processing. In this paper, we propose a Data stream library for Big Data Preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used Data Preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection. Results The algorithms have been tested using two Big Data Datasets. Experimental results show that Preprocessing can not only reduce the size of the Data, but also maintain or even improve the original accuracy in a short period of time. Conclusion DPASF contains algorithms that are useful when dealing with Big Data Data streams. The Preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the Data.

  • a survey on Data Preprocessing for Data stream mining
    Neurocomputing, 2017
    Co-Authors: Sergio Ramrezgallego, Salvador Garca, Bartosz Krawczyk, Micha Woniak, Francisco Herrera
    Abstract:

    Data Preprocessing and reduction have become essential techniques in current knowledge discovery scenarios, dominated by increasingly large Datasets. These methods aim at reducing the complexity inherent to real-world Datasets, so that they can be easily processed by current Data mining solutions. Advantages of such approaches include, among others, a faster and more precise learning process, and more understandable structure of raw Data. However, in the context of Data Preprocessing techniques for Data streams have a long road ahead of them, despite online learning is growing in importance thanks to the development of Internet and technologies for massive Data collection. Throughout this survey, we summarize, categorize and analyze those contributions on Data Preprocessing that cope with streaming Data. This work also takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization). To enrich our study, we conduct thorough experiments using the most relevant contributions and present an analysis of their predictive performance, reduction rates, computational time, and memory usage. Finally, we offer general advices about existing Data stream Preprocessing algorithms, as well as discuss emerging future challenges to be faced in the domain of Data stream Preprocessing.

  • a snapshot of image pre processing for convolutional neural networks case study of mnist
    International Journal of Computational Intelligence Systems, 2017
    Co-Authors: Siham Tabik, Daniel Peralta, Andres Herrerapoyatos, Francisco Herrera
    Abstract:

    In the last five years, deep learning methods and particularly Convolutional Neural Networks (CNNs) have exhibited excellent accuracies in many pattern classification problems. Most of the state-of-the-art models apply Data-augmentation techniques at the training stage. This paper provides a brief tutorial on Data Preprocessing and shows its benefits by using the competitive MNIST handwritten digits classification problem. We show and analyze the impact of different Preprocessing techniques on the performance of three CNNs, LeNet, Network3 and DropConnect, together with their ensembles. The analyzed transformations are, centering, elastic deformation, translation, rotation and different combinations of them. Our analysis demonstrates that Data-Preprocessing techniques, such as the combination of elastic deformation and rotation, together with ensembles have a high potential to further improve the state-of-the-art accuracy in MNIST classification.

  • tutorial on practical tips of the most influential Data Preprocessing algorithms in Data mining
    Knowledge Based Systems, 2016
    Co-Authors: Julián Luengo, Salvador García, Francisco Herrera
    Abstract:

    Abstract Data Preprocessing is a major and essential stage whose main goal is to obtain final Data sets that can be considered correct and useful for further Data mining algorithms. This paper summarizes the most influential Data Preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of Data for imbalanced Preprocessing. They constitute all among the most important topics in Data Preprocessing research and development. This paper emphasizes on the most well-known Preprocessing methods and their practical study, selected after a recent, generic book on Data Preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different Data sets that provide useful tips for the use of Preprocessing algorithms. In the first place, we graphically present the effects on two benchmark Data sets for the Preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL’2014 Big Data competition to provide a thorough analysis on the application of some Preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.

  • Big Data Preprocessing: methods and prospects
    Big Data Analytics, 2016
    Co-Authors: Salvador García, Sergio Ramirez-gallego, Julián Luengo, José Manuel Benítez, Francisco Herrera
    Abstract:

    The massive growth in the scale of Data has been observed in recent years being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety of Data that require a new high-performance processing. Addressing big Data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful Data processing and analysis. The presence of Data Preprocessing methods for Data mining in big Data is reviewed in this paper. The definition, characteristics, and categorization of Data Preprocessing approaches in big Data are introduced. The connection between big Data and Data Preprocessing throughout all families of methods and big Data technologies are also examined, including a review of the state-of-the-art. In addition, research challenges are discussed, with focus on developments on different big Data framework, such as Hadoop, Spark and Flink and the encouragement in devoting substantial research efforts in some families of Data Preprocessing methods and applications on new big Data learning paradigms.

Salvador García - One of the best experts on this subject based on the ideXlab platform.

  • DPASF: a flink library for streaming Data Preprocessing
    Big Data Analytics, 2019
    Co-Authors: Alejandro Alcalde-barros, Salvador García, Diego García-gil, Francisco Herrera
    Abstract:

    Background Data Preprocessing techniques are devoted to correcting or alleviating errors in Data. Discretization and feature selection are two of the most extended Data Preprocessing techniques. Although we can find many proposals for static Big Data Preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch Data processing. In this paper, we propose a Data stream library for Big Data Preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used Data Preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection. Results The algorithms have been tested using two Big Data Datasets. Experimental results show that Preprocessing can not only reduce the size of the Data, but also maintain or even improve the original accuracy in a short period of time. Conclusion DPASF contains algorithms that are useful when dealing with Big Data Data streams. The Preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the Data.

  • tutorial on practical tips of the most influential Data Preprocessing algorithms in Data mining
    Knowledge Based Systems, 2016
    Co-Authors: Julián Luengo, Salvador García, Francisco Herrera
    Abstract:

    Abstract Data Preprocessing is a major and essential stage whose main goal is to obtain final Data sets that can be considered correct and useful for further Data mining algorithms. This paper summarizes the most influential Data Preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of Data for imbalanced Preprocessing. They constitute all among the most important topics in Data Preprocessing research and development. This paper emphasizes on the most well-known Preprocessing methods and their practical study, selected after a recent, generic book on Data Preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different Data sets that provide useful tips for the use of Preprocessing algorithms. In the first place, we graphically present the effects on two benchmark Data sets for the Preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL’2014 Big Data competition to provide a thorough analysis on the application of some Preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.

  • Big Data Preprocessing: methods and prospects
    Big Data Analytics, 2016
    Co-Authors: Salvador García, Sergio Ramirez-gallego, Julián Luengo, José Manuel Benítez, Francisco Herrera
    Abstract:

    The massive growth in the scale of Data has been observed in recent years being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety of Data that require a new high-performance processing. Addressing big Data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful Data processing and analysis. The presence of Data Preprocessing methods for Data mining in big Data is reviewed in this paper. The definition, characteristics, and categorization of Data Preprocessing approaches in big Data are introduced. The connection between big Data and Data Preprocessing throughout all families of methods and big Data technologies are also examined, including a review of the state-of-the-art. In addition, research challenges are discussed, with focus on developments on different big Data framework, such as Hadoop, Spark and Flink and the encouragement in devoting substantial research efforts in some families of Data Preprocessing methods and applications on new big Data learning paradigms.

C L Wu - One of the best experts on this subject based on the ideXlab platform.

  • prediction of rainfall time series using modular artificial neural networks coupled with Data Preprocessing techniques
    Journal of Hydrology, 2010
    Co-Authors: C L Wu, Kwokwing Chau
    Abstract:

    This study is an attempt to seek a relatively optimal Data-driven model for rainfall forecasting from three aspects: model inputs, modeling methods, and Data-Preprocessing techniques. Four rain Data records from different regions, namely two monthly and two daily series, are examined. A comparison of seven input techniques, either linear or nonlinear, indicates that linear correlation analysis (LCA) is capable of identifying model inputs reasonably. A proposed model, modular artificial neural network (MANN), is compared with three benchmark models, viz. artificial neural network (ANN), K-nearest-neighbors (K-NN), and linear regression (LR). Prediction is performed in the context of two modes including normal mode (viz., without Data Preprocessing) and Data Preprocessing mode. Results from the normal mode indicate that MANN performs the best among all four models, but the advantage of MANN over ANN is not significant in monthly rainfall series forecasting. Under the Data Preprocessing mode, each of LR, K-NN and ANN is respectively coupled with three Data-Preprocessing techniques including moving average (MA), principal component analysis (PCA), and singular spectrum analysis (SSA). Results indicate that the improvement of model performance generated by SSA is considerable whereas those of MA or PCA are slight. Moreover, when MANN is coupled with SSA, results show that advantages of MANN over other models are quite noticeable, particularly for daily rainfall forecasting. Therefore, the proposed optimal rainfall forecasting model can be derived from MANN coupled with SSA.

Kwokwing Chau - One of the best experts on this subject based on the ideXlab platform.

  • prediction of rainfall time series using modular artificial neural networks coupled with Data Preprocessing techniques
    Journal of Hydrology, 2010
    Co-Authors: C L Wu, Kwokwing Chau
    Abstract:

    This study is an attempt to seek a relatively optimal Data-driven model for rainfall forecasting from three aspects: model inputs, modeling methods, and Data-Preprocessing techniques. Four rain Data records from different regions, namely two monthly and two daily series, are examined. A comparison of seven input techniques, either linear or nonlinear, indicates that linear correlation analysis (LCA) is capable of identifying model inputs reasonably. A proposed model, modular artificial neural network (MANN), is compared with three benchmark models, viz. artificial neural network (ANN), K-nearest-neighbors (K-NN), and linear regression (LR). Prediction is performed in the context of two modes including normal mode (viz., without Data Preprocessing) and Data Preprocessing mode. Results from the normal mode indicate that MANN performs the best among all four models, but the advantage of MANN over ANN is not significant in monthly rainfall series forecasting. Under the Data Preprocessing mode, each of LR, K-NN and ANN is respectively coupled with three Data-Preprocessing techniques including moving average (MA), principal component analysis (PCA), and singular spectrum analysis (SSA). Results indicate that the improvement of model performance generated by SSA is considerable whereas those of MA or PCA are slight. Moreover, when MANN is coupled with SSA, results show that advantages of MANN over other models are quite noticeable, particularly for daily rainfall forecasting. Therefore, the proposed optimal rainfall forecasting model can be derived from MANN coupled with SSA.

  • predicting monthly streamflow using Data driven models coupled with Data Preprocessing techniques
    Water Resources Research, 2009
    Co-Authors: Kwokwing Chau
    Abstract:

    [1] In this paper, the accuracy performance of monthly streamflow forecasts is discussed when using Data-driven modeling techniques on the streamflow series. A crisp distributed support vectors regression (CDSVR) model was proposed for monthly streamflow prediction in comparison with four other models: autoregressive moving average (ARMA), K-nearest neighbors (KNN), artificial neural networks (ANNs), and crisp distributed artificial neural networks (CDANN). With respect to distributed models of CDSVR and CDANN, the fuzzy C-means (FCM) clustering technique first split the flow Data into three subsets (low, medium, and high levels) according to the magnitudes of the Data, and then three single SVRs (or ANNs) were fitted to three subsets. This paper gives a detailed analysis on reconstruction of dynamics that was used to identify the configuration of all models except for ARMA. To improve the model performance, the Data-Preprocessing techniques of singular spectrum analysis (SSA) and/or moving average (MA) were coupled with all five models. Some discussions were presented (1) on the number of neighbors in KNN; (2) on the configuration of ANN; and (3) on the investigation of effects of MA and SSA. Two streamflow series from different locations in China (Xiangjiaba and Danjiangkou) were applied for the analysis of forecasting. Forecasts were conducted at four different horizons (1-, 3-, 6-, and 1 2-month-ahead forecasts). The results showed that models fed by preprocessed Data performed better than models fed by original Data, and CDSVR outperformed other models except for at a 6-month-ahead horizon for Danjiangkou. For the perspective of streamflow series, the SSA exhibited better effects on Danjingkou Data because its raw discharge series was more complex than the discharge of Xiangjiaba. The MA considerably improved the performance of ANN, CDANN, and CDSVR by adjusting the correlation relationship between input components and output of models. It was also found that the performance of CDSVR deteriorated with the increase of the forecast horizon.

Michal Munk - One of the best experts on this subject based on the ideXlab platform.

  • User Identification in the Process of Web Usage Data Preprocessing
    International Journal of Emerging Technologies in Learning (ijet), 2019
    Co-Authors: Jozef Kapusta, Michal Munk, Dominik Halvoník, Martin Drlik
    Abstract:

    If we are talking about user behavior analytics, we have to understand what the main source of valuable information is. One of these sources is definitely a web server. There are multiple places where we can extract the necessary Data. The most common ways are to search for these Data in access log, error log, custom log files of web server, proxy server log file, web browser log, browser cookies etc. A web server log is in its default form known as a Common Log File (W3C, 1995) and keeps information about IP address; date and time of visit; ac-cessed and referenced resource. There are standardized methodologies which contain several steps leading to extract new knowledge from provided Data. Usu-ally, the first step is in each one of them to identify users, users’ sessions, page views, and clickstreams. This process is called pre-processing. Main goal of this stage is to receive unprocessed web server log file as input and after processing outputs meaningful representations which can be used in next phase. In this pa-per, we describe in detail user session identification which can be considered as most important part of Data pre-processing. Our paper aims to compare the us-er/session identification using the STT with the identification of user/session us-ing cookies. This comparison was performed concerning the quality of the se-quential rules generated, i.e., a comparison was made regarding generation useful, trivial and inexplicable rules.

  • Using Entropy in Web Usage Data Preprocessing
    Entropy, 2018
    Co-Authors: Michal Munk, Lubomir Benko
    Abstract:

    The paper is focused on an examination of the use of entropy in the field of web usage mining. Entropy creates an alternative possibility of determining the ratio of auxiliary pages in the session identification using the Reference Length method. The experiment was conducted on two different web portals. The first log file was obtained from a course of virtual learning environment web portal. The second log file was received from the web portal with anonymous access. A comparison of the results of entropy estimation of the ratio of auxiliary pages and a sitemap estimation of the ratio of auxiliary pages showed that in the case of sitemap abundance, entropy could be a full-valued substitution for the estimate of the ratio of auxiliary pages.

  • Data Preprocessing evaluation for web log mining reconstruction of activities of a web visitor
    International Conference on Conceptual Structures, 2010
    Co-Authors: Michal Munk, Jozef Kapusta, Peter Svec
    Abstract:

    Abstract Presumptions of each Data analysis are Data themselves, regardless of the analysis focus (visit rate analysis, optimization of portal, personalization of portal, etc.). Results of selected analysis highly depend on the quality of analyzed Data. In case of portal usage analysis, these Data can be obtained by monitoring web server log file. We are able to create Data matrices and web map based on these Data which will serve for searching for behaviour patterns of users. Data preparation from the log file represents the most time-consuming phase of whole analysis. We realized an experiment so that we can find out to which criteria are necessary to realize this time-consuming Data preparation. We aimed at specifying the inevitable steps that are required for obtaining valid Data from the log file. Specially, we focused on the reconstruction of activities of the web visitor. This advanced technique of Data Preprocessing belongs to time consuming one. In the article we tried to assess the impact of reconstruction of activities of a web visitor on the quantity and quality of the extracted rules which represent the web users’ behaviour patterns.