The Experts below are selected from a list of 6249 Experts worldwide ranked by ideXlab platform
Pramit Basu - One of the best experts on this subject based on the ideXlab platform.
-
Data Cleansing as a transient service
Proceedings - International Conference on Data Engineering, 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, K. Hima Prasad, L. Venkata Subramaniam, Shrinivas Kulkarni, Mukesh Mohania, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.
-
ICDE - Data Cleansing as a transient service
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, Shrinivas Kulkarni, Mukesh Mohania, K. Hima Prasad, L. Venkata Subramaniam, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.
Min-te Sun - One of the best experts on this subject based on the ideXlab platform.
-
A Bayesian Inference-Based Framework for RFID Data Cleansing
IEEE Transactions on Knowledge and Data Engineering, 2013Co-Authors: Haiquan Chen, Haixun Wang, Min-te SunAbstract:The past few years have witnessed the emergence of an increasing number of applications for tracking and tracing based on radio frequency identification (RFID) technologies. However, raw RFID readings are usually of low quality and may contain numerous anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the environment may help improve Data quality, and a desired solution must be able to take into account such knowledge. Third, the solution should take advantage of physical constraints in target applications to elevate the accuracy of Data Cleansing. There are several existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper, we propose a Bayesian inference-based framework for cleaning RFID raw Data. We first design an n-state detection model and formally prove that the three-state model can maximize the system performance. Then, we extend the n-state model to support two-dimensional RFID reader arrays and compute the likelihood efficiently. In addition, we devise a Metropolis-Hastings sampler with constraints, which incorporates constraint management to clean RFID Data with high efficiency and accuracy. Moreover, to support real-time object monitoring, we present the streaming Bayesian inference method to cope with realtime RFID Data streams. Finally, we evaluate the performance of our solutions through extensive experiments.
-
Leveraging spatio-temporal redundancy for RFID Data Cleansing
2010Co-Authors: Haiquan Chen, Wei-shinn Ku, Haixun Wang, Min-te SunAbstract:Radio Frequency Identification (RFID) technologies are used in many applications for Data collection. However, raw RFID readings are usually of low quality and may contain many anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings (by multiple readers simultaneously or by a single reader over a period of time) of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the readers and the environment (e.g., prior Data distribution, false negative rates of readers) may help improve Data quality and remove Data anomalies, and a desired solution must be able to quantify the degree of uncertainty based on such knowledge. Third, the solution should take advantage of given constraints in target applications (e.g., the number of objects in a same location cannot exceed a given value) to elevate the accuracy of Data Cleansing. There are a number of existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper we propose a Bayesian inference based approach for cleaning RFID raw Data. Our approach takes full advantage of Data redundancy. To capture the likelihood, we design an n-state detection model and formally prove that the 3-state model can maximize the system performance. Moreover, in order to sample from the posterior, we devise a Metropolis-Hastings sampler with Constraints (MH-C), which incorporates constraint management to clean RFID raw Data with high efficiency and accuracy. We validate our solution with a common RFID application and demonstrate the advantages of our approach through extensive simulations.
-
SIGMOD Conference - Leveraging spatio-temporal redundancy for RFID Data Cleansing
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010Co-Authors: Haiquan Chen, Haixun Wang, Min-te SunAbstract:Radio Frequency Identification (RFID) technologies are used in many applications for Data collection. However, raw RFID readings are usually of low quality and may contain many anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings (by multiple readers simultaneously or by a single reader over a period of time) of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the readers and the environment (e.g., prior Data distribution, false negative rates of readers) may help improve Data quality and remove Data anomalies, and a desired solution must be able to quantify the degree of uncertainty based on such knowledge. Third, the solution should take advantage of given constraints in target applications (e.g., the number of objects in a same location cannot exceed a given value) to elevate the accuracy of Data Cleansing. There are a number of existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper we propose a Bayesian inference based approach for cleaning RFID raw Data. Our approach takes full advantage of Data redundancy. To capture the likelihood, we design an n-state detection model and formally prove that the 3-state model can maximize the system performance. Moreover, in order to sample from the posterior, we devise a Metropolis-Hastings sampler with Constraints (MH-C), which incorporates constraint management to clean RFID raw Data with high efficiency and accuracy. We validate our solution with a common RFID application and demonstrate the advantages of our approach through extensive simulations.
Tanveer Afzal Faruquie - One of the best experts on this subject based on the ideXlab platform.
-
Data Cleansing Techniques for Large Enterprise Datasets
2011 Annual SRII Global Conference, 2011Co-Authors: K. Hima Prasad, Tanveer Afzal Faruquie, Sachindra Joshi, Snigdha Chaturvedi, L. Venkata Subramaniam, Mukesh MohaniaAbstract:Data quality improvement is an important aspect of enterprise Data management. Data characteristics can change with customers, with domain and geography making Data quality improvement a challenging task. Data quality improvement is often an iterative process which mainly involves writing a set of Data quality rules for standardization and elimination of duplicates that are present within the Data. Existing Data Cleansing tools require a fair amount of customization whenever moving from one customer to another and from one domain to another. In this paper, we present a Data quality improvement tool which helps the Data quality practitioner by showing the characteristics of the entities present in the Data. The tool identifies the variants and synonyms of a given entity present in the Data which is an important task for writing Data quality rules for standardizing the Data. We present a ripple down rule framework for maintaining Data quality rules which helps in reducing the services effort for adding new rules. We also present a typical workflow of the Data quality improvement process and show the usefulness of the tool at each step. We also present some experimental results and discussions on the usefulness of the tools for reducing services effort in a Data quality improvement.
-
Optimal Training Data Selection for Rule-Based Data Cleansing Models
2011 Annual SRII Global Conference, 2011Co-Authors: Snigdha Chaturvedi, Girish Venkatachaliah, Tanveer Afzal Faruquie, K. Hima Prasad, L. Venkata Subramaniam, Sriram PadmanabhanAbstract:Enterprises today accumulate huge quantities of Data which is often noisy and unstructured in nature making Data Cleansing an important task. Data Cleansing refers to standardizing Data from different sources to a common format so that Data can be better utilized. Most of the enterprise Data Cleansing models are rule based involving lot of manual effort. Writing Data quality rules is tedious task and often results in creation of erroneous rules because of the ambiguities that the Data presents. A robust Data Cleansing model should be capable of handling a wide variety of records which is often dependant on the choice of the sample records knowledge engineer uses to write the rules. In this paper we present a method to select a diverse set of Data records which when used to create the rule based Data Cleansing model can cover the maximum number of records. We also present a similarity metric between two records which help in choosing the diverse set of Data samples. We also present a crowd sourcing based labeling mechanism to label the diverse records selected by the system so that collective intelligence of crowd can be used to eliminate the errors that occur in labeling sample Data. We also present a method to select difficult set of diverse examples so that the crowd and the rule writer services can be effectively utilized to create a better Cleansing model. We also present a method selection of such records for updating an existing rule set. We present the experimental results to show the effectiveness of the proposed methods. Results demonstrate an increase of 12% in the number of rules written, using this procedure. We also show that the method identifies records on which the existing model yields lower accuracy than on the records identified by other techniques, and thus identifies records that are more difficult to cleanse for the existing model.
-
Data Cleansing as a transient service
Proceedings - International Conference on Data Engineering, 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, K. Hima Prasad, L. Venkata Subramaniam, Shrinivas Kulkarni, Mukesh Mohania, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.
-
ICDE - Data Cleansing as a transient service
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, Shrinivas Kulkarni, Mukesh Mohania, K. Hima Prasad, L. Venkata Subramaniam, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.
Haiquan Chen - One of the best experts on this subject based on the ideXlab platform.
-
A Bayesian Inference-Based Framework for RFID Data Cleansing
IEEE Transactions on Knowledge and Data Engineering, 2013Co-Authors: Haiquan Chen, Haixun Wang, Min-te SunAbstract:The past few years have witnessed the emergence of an increasing number of applications for tracking and tracing based on radio frequency identification (RFID) technologies. However, raw RFID readings are usually of low quality and may contain numerous anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the environment may help improve Data quality, and a desired solution must be able to take into account such knowledge. Third, the solution should take advantage of physical constraints in target applications to elevate the accuracy of Data Cleansing. There are several existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper, we propose a Bayesian inference-based framework for cleaning RFID raw Data. We first design an n-state detection model and formally prove that the three-state model can maximize the system performance. Then, we extend the n-state model to support two-dimensional RFID reader arrays and compute the likelihood efficiently. In addition, we devise a Metropolis-Hastings sampler with constraints, which incorporates constraint management to clean RFID Data with high efficiency and accuracy. Moreover, to support real-time object monitoring, we present the streaming Bayesian inference method to cope with realtime RFID Data streams. Finally, we evaluate the performance of our solutions through extensive experiments.
-
Leveraging spatio-temporal redundancy for RFID Data Cleansing
2010Co-Authors: Haiquan Chen, Wei-shinn Ku, Haixun Wang, Min-te SunAbstract:Radio Frequency Identification (RFID) technologies are used in many applications for Data collection. However, raw RFID readings are usually of low quality and may contain many anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings (by multiple readers simultaneously or by a single reader over a period of time) of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the readers and the environment (e.g., prior Data distribution, false negative rates of readers) may help improve Data quality and remove Data anomalies, and a desired solution must be able to quantify the degree of uncertainty based on such knowledge. Third, the solution should take advantage of given constraints in target applications (e.g., the number of objects in a same location cannot exceed a given value) to elevate the accuracy of Data Cleansing. There are a number of existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper we propose a Bayesian inference based approach for cleaning RFID raw Data. Our approach takes full advantage of Data redundancy. To capture the likelihood, we design an n-state detection model and formally prove that the 3-state model can maximize the system performance. Moreover, in order to sample from the posterior, we devise a Metropolis-Hastings sampler with Constraints (MH-C), which incorporates constraint management to clean RFID raw Data with high efficiency and accuracy. We validate our solution with a common RFID application and demonstrate the advantages of our approach through extensive simulations.
-
SIGMOD Conference - Leveraging spatio-temporal redundancy for RFID Data Cleansing
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010Co-Authors: Haiquan Chen, Haixun Wang, Min-te SunAbstract:Radio Frequency Identification (RFID) technologies are used in many applications for Data collection. However, raw RFID readings are usually of low quality and may contain many anomalies. An ideal solution for RFID Data Cleansing should address the following issues. First, in many applications, duplicate readings (by multiple readers simultaneously or by a single reader over a period of time) of the same object are very common. The solution should take advantage of the resulting Data redundancy for Data cleaning. Second, prior knowledge about the readers and the environment (e.g., prior Data distribution, false negative rates of readers) may help improve Data quality and remove Data anomalies, and a desired solution must be able to quantify the degree of uncertainty based on such knowledge. Third, the solution should take advantage of given constraints in target applications (e.g., the number of objects in a same location cannot exceed a given value) to elevate the accuracy of Data Cleansing. There are a number of existing RFID Data Cleansing techniques. However, none of them support all the aforementioned features. In this paper we propose a Bayesian inference based approach for cleaning RFID raw Data. Our approach takes full advantage of Data redundancy. To capture the likelihood, we design an n-state detection model and formally prove that the 3-state model can maximize the system performance. Moreover, in order to sample from the posterior, we devise a Metropolis-Hastings sampler with Constraints (MH-C), which incorporates constraint management to clean RFID raw Data with high efficiency and accuracy. We validate our solution with a common RFID application and demonstrate the advantages of our approach through extensive simulations.
Girish Venkatachaliah - One of the best experts on this subject based on the ideXlab platform.
-
Optimal Training Data Selection for Rule-Based Data Cleansing Models
2011 Annual SRII Global Conference, 2011Co-Authors: Snigdha Chaturvedi, Girish Venkatachaliah, Tanveer Afzal Faruquie, K. Hima Prasad, L. Venkata Subramaniam, Sriram PadmanabhanAbstract:Enterprises today accumulate huge quantities of Data which is often noisy and unstructured in nature making Data Cleansing an important task. Data Cleansing refers to standardizing Data from different sources to a common format so that Data can be better utilized. Most of the enterprise Data Cleansing models are rule based involving lot of manual effort. Writing Data quality rules is tedious task and often results in creation of erroneous rules because of the ambiguities that the Data presents. A robust Data Cleansing model should be capable of handling a wide variety of records which is often dependant on the choice of the sample records knowledge engineer uses to write the rules. In this paper we present a method to select a diverse set of Data records which when used to create the rule based Data Cleansing model can cover the maximum number of records. We also present a similarity metric between two records which help in choosing the diverse set of Data samples. We also present a crowd sourcing based labeling mechanism to label the diverse records selected by the system so that collective intelligence of crowd can be used to eliminate the errors that occur in labeling sample Data. We also present a method to select difficult set of diverse examples so that the crowd and the rule writer services can be effectively utilized to create a better Cleansing model. We also present a method selection of such records for updating an existing rule set. We present the experimental results to show the effectiveness of the proposed methods. Results demonstrate an increase of 12% in the number of rules written, using this procedure. We also show that the method identifies records on which the existing model yields lower accuracy than on the records identified by other techniques, and thus identifies records that are more difficult to cleanse for the existing model.
-
Data Cleansing as a transient service
Proceedings - International Conference on Data Engineering, 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, K. Hima Prasad, L. Venkata Subramaniam, Shrinivas Kulkarni, Mukesh Mohania, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.
-
ICDE - Data Cleansing as a transient service
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010Co-Authors: Tanveer Afzal Faruquie, Girish Venkatachaliah, Shrinivas Kulkarni, Mukesh Mohania, K. Hima Prasad, L. Venkata Subramaniam, Pramit BasuAbstract:There is often a transient need within enterprises for Data Cleansing which can be satisfied by offering Data Cleansing as a transient service. Every time a Data Cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for Data Cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized Data Cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the Cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the Data Cleansing needs of an enterprise.