Data Profiling

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 123201 Experts worldwide ranked by ideXlab platform

Felix Naumann - One of the best experts on this subject based on the ideXlab platform.

  • Data Profiling
    2018
    Co-Authors: Ziawasch Abedjan, Lukasz Golab, Felix Naumann
    Abstract:

    One of the crucial requirements before consuming Datasets for any application is to understand the Dataset at hand and its metaData. The process of metaData discovery is known as Data Profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the Data or formulating aggregation queries, to systematic inference of structural information and statistics of a Dataset using dedicated Profiling tools. In this tutorial, we highlight the importance of Data Profiling as part of any Data-related use-case, and discuss the area of Data Profiling by classifying Data Profiling tasks and reviewing the state-of-the-art Data Profiling systems and techniques. In particular, we discuss hard problems in Data Profiling, such as algorithms for dependency discovery and Profiling algorithms for dynamic Data and streams. We conclude with directions for future research in the area of Data Profiling. This tutorial is based on our survey on Profiling relational Data [1].

  • Profiling relational Data: a survey
    2015
    Co-Authors: Ziawasch Abedjan, Lukasz Golab, Felix Naumann
    Abstract:

    Profiling Data to determine metaData about a given Dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine Datasets and produce metaData. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its Data type, or the most frequent patterns of its Data values. MetaData that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the Dataset at hand. This survey provides a classification of Data Profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review Data Profiling tools and systems from research and industry. We conclude with an outlook on the future of Data Profiling beyond traditional Profiling tasks and beyond relational Databases.

  • Data Profiling with metanome
    2015
    Co-Authors: Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, Felix Naumann
    Abstract:

    Data Profiling is the discipline of discovering metaData about given Datasets. The metaData itself serve a variety of use cases, such as Data integration, Data cleansing, or query optimization. Due to the importance of Data Profiling in practice, many tools have emerged that support Data scientists and IT professionals in this task. These tools provide good support for Profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies. We present Metanome, an extensible Profiling platform that incorporates many state-of-the-art Profiling algorithms. While Metanome is able to calculate simple Profiling statistics in relational Data, its focus lies on the automatic discovery of complex metaData. Metanome's goal is to provide novel Profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank Profiling results according to various metrics and to visualize the, at times, large metaData sets.

  • Data Profiling revisited
    2014
    Co-Authors: Felix Naumann
    Abstract:

    Data Profiling comprises a broad range of methods to efficiently analyze a given Data set. In a typical scenario, which mirrors the capabilities of commercial Data Profiling tools, tables of a relational Database are scanned to derive metaData, such as Data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional Profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies. Data Profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more Data beyond the traditional relational Databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental Profiling and Profiling heterogeneous and non-relational Data.

  • Profiling linked open Data with ProLOD
    2010
    Co-Authors: Christoph Böhm, Dandy Fenz, Toni Grütze, Daniel Hefenbrock, Ziawasch Abedjan, Matthias Pohl-orf, Felix Naumann, David Sonnabend
    Abstract:

    Linked open Data (LOD), as provided by a quickly growing number of sources constitutes a wealth of easily accessible information. However, this Data is not easy to understand. It is usually provided as a set of (RDF) triples, often enough in the form of enormous files covering many domains. What is more, the Data usually has a loose structure when it is derived from end-user generated sources, such as Wikipedia. Finally, the quality of the actual Data is also worrisome, because it may be incomplete, poorly formatted, inconsistent, etc. To understand and profile such linked open Data, traditional Data Profiling methods do not suffice. With ProLOD, we propose a suite of methods ranging from the domain level (clustering, labeling), via the schema level (matching, disambiguation), to the Data level (Data type detection, pattern detection, value distribution). Packaged into an interactive, web-based tool, they allow iterative exploration and discovery of new LOD sources. Thus, users can quickly gauge the relevance of the source for the problem at hand (e.g., some integration task), focus on and explore the relevant subset.

J De Las Rivas - One of the best experts on this subject based on the ideXlab platform.

  • deco decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic Data Profiling
    2019
    Co-Authors: Francisco J Camposlaborie, Alberto Risueno, Maria Ortizestevez, B Rosonburgo, Conrad Droste, Celia Fontanillo, Remco Loos, Jose Manuel Sanchezsantos, Matthew Trotter, J De Las Rivas
    Abstract:

    MOTIVATION: Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous Data avoiding classical normalization approaches of reducing or removing variation. RESULTS: DEcomposing heterogeneous Cohorts using Omic Data Profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic Data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic Data dispersion and predictor-response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated Data and five experimental transcriptomic Datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. AVAILABILITY AND IMPLEMENTATION: DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). SUPPLEMENTARY INFORMATION: Supplementary Data are available at Bioinformatics online.

  • deco decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic Data Profiling
    2019
    Co-Authors: Francisco J Camposlaborie, Alberto Risueno, Maria Ortizestevez, B Rosonburgo, Conrad Droste, Celia Fontanillo, Remco Loos, Jose Manuel Sanchezsantos, Matthew Trotter, J De Las Rivas
    Abstract:

    This work was supported by the research grant [AC14/00024], from the Instituto de Salud Carlos III (ISCiii) co-funded by the Fondo Europeo de Desarrollo Regional (FEDER). A PhD research grant to F.J.C.L. (from the Programme Ayudas a la Contratacio´n de Personal Investigador) provided by the Junta de Castilla y Leon (JCyL) with the support of the Fondo Social Europeo (FSE) is also acknowledged. C.F. received funding from the Spanish Ministry MINECO, Torres-Quevedo Programme [reference PTQ-13-06319].

Francisco J Camposlaborie - One of the best experts on this subject based on the ideXlab platform.

  • deco decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic Data Profiling
    2019
    Co-Authors: Francisco J Camposlaborie, Alberto Risueno, Maria Ortizestevez, B Rosonburgo, Conrad Droste, Celia Fontanillo, Remco Loos, Jose Manuel Sanchezsantos, Matthew Trotter, J De Las Rivas
    Abstract:

    MOTIVATION: Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous Data avoiding classical normalization approaches of reducing or removing variation. RESULTS: DEcomposing heterogeneous Cohorts using Omic Data Profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic Data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic Data dispersion and predictor-response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated Data and five experimental transcriptomic Datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. AVAILABILITY AND IMPLEMENTATION: DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). SUPPLEMENTARY INFORMATION: Supplementary Data are available at Bioinformatics online.

  • deco decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic Data Profiling
    2019
    Co-Authors: Francisco J Camposlaborie, Alberto Risueno, Maria Ortizestevez, B Rosonburgo, Conrad Droste, Celia Fontanillo, Remco Loos, Jose Manuel Sanchezsantos, Matthew Trotter, J De Las Rivas
    Abstract:

    This work was supported by the research grant [AC14/00024], from the Instituto de Salud Carlos III (ISCiii) co-funded by the Fondo Europeo de Desarrollo Regional (FEDER). A PhD research grant to F.J.C.L. (from the Programme Ayudas a la Contratacio´n de Personal Investigador) provided by the Junta de Castilla y Leon (JCyL) with the support of the Fondo Social Europeo (FSE) is also acknowledged. C.F. received funding from the Spanish Ministry MINECO, Torres-Quevedo Programme [reference PTQ-13-06319].

Cenk Sahinalp - One of the best experts on this subject based on the ideXlab platform.

  • abstract lb020 epigenomic tumor evolution modeling with single cell methylation Data Profiling
    2021
    Co-Authors: Yuelin Liu, Farid Rashidi, Salem Malikic, Stephen M Mount, Eytan Ruppin, Kenneth Aldape, Cenk Sahinalp
    Abstract:

    The heritability of methylation patterns in tumor cells, as shown in recent studies, suggests that tumor heterogeneity and progression can be interpreted and predicted in the context of methylation changes. To elucidate methylation-based evolution trajectory in tumors, we introduce a novel computational method for methylation phylogeny reconstruction leveraging single cell bisulfite treated whole genome sequencing Data (scBS-seq), incorporating additional copy number information inferred independently from matched single cell RNA sequencing (scRNA-seq) Data, when available. We validate our method with the scBS-seq Data of multi-regionally sampled colorectal cancer cells, and demonstrate that the cell lineages constructed by our method strongly correlate with original sampling regions. Our method consists of three components: (i) noise-minimizing site selection, (ii) likelihood-based sequencing error correction, and (iii) pairwise expected distance calculation for cells, all designed to mitigate the effect of noise and uncertainty due to Data sparsity commonly observed in scBS-seq Data. In (i), we present an integer linear program-based biclustering formulation to select a set of CpG-sites and cells so that the number of CpG-sites with non-zero coverage in the selected cells is maximized. This procedure filters out cells with read information in too few sites and CpG-sites with read information in too few cells. In (ii), we address the sequencing errors commonly encountered in currently available platforms with a maximum log likelihood approach to correct likely sequencing errors in scBS-seq reads, incorporating CpG-site copy number information in case it can be orthogonally obtained. Given the copy number and read information for a site in a cell, together with the overall sequencing error probability, we compute the log likelihood for all possible underlying allele statuses. If the mixed read statuses at the CpG-site for the cell are more likely due to sequencing error on homozygous alleles as opposed to the presence of alleles mixed methylation statuses, we correct the reads of the minority methylation status to the majority one. In (iii), we introduce a formulation to estimate distances between any pair of cells. As scBS-seq Data is typically characterized by shallow read coverage, there is rarely read count evidence for two (or more, depending on CNV status) alleles at a CpG-site. Since allele-specific methylation has been shown to have increased frequency in cancer tissues, given the reads at a CpG-site, it is especially important to consider the possibility of unobserved alleles and their methylation status when determining the CpG-site9s possible methylation zygosities. Our method incorporates copy number information when available, and for each CpG-site in a cell, we compute a probability distribution across all possible methylation zygosities. Then, given specific distance values between pairs of distinct zygosities and the likelihood of each possible zygosity for each shared CpG-site in both cells, we compute the expected total distance between any pair of cells as the mean of expected distances across all shared CpG-sites. We leverage such pairwise distances in methylation phylogeny construction. Citation Format: Xuan C. Li, Yuelin Liu, Farid Rashidi, Salem Malikic, Stephen M. Mount, Eytan Ruppin, Kenneth Aldape, Cenk Sahinalp. Epigenomic tumor evolution modeling with single-cell methylation Data Profiling [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr LB020.

Steven H. Kleinstein - One of the best experts on this subject based on the ideXlab platform.

  • Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations
    2013
    Co-Authors: Gur Yaari, Christopher R. Bolen, Juilee Thakar, Steven H. Kleinstein
    Abstract:

    Enrichment analysis of gene sets is a popular approach that provides a functional interpretation of genome-wide expression Data. Existing tests are affected by inter-gene correlations, resulting in a high Type I error. The most widely used test, Gene Set Enrichment Analysis, relies on computationally intensive permutations of sample labels to generate a null distribution that preserves gene-gene correlations. A more recent approach, CAMERA, attempts to correct for these correlations by estimating a variance inflation factor directly from the Data. Although these methods generate P-values for detecting gene set activity, they are unable to produce confidence intervals or allow for post hoc comparisons. We have developed a new computational framework for Quantitative Set Analysis of Gene Expression (QuSAGE). QuSAGE accounts for inter-gene correlations, improves the estimation of the variance inflation factor and, rather than evaluating the deviation from a null hypothesis with a P-value, it quantifies gene-set activity with a complete probability density function. From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability. Compared with Gene Set Enrichment Analysis and CAMERA, QuSAGE exhibits better sensitivity and specificity on real Data Profiling the response to interferon therapy (in chronic Hepatitis C virus patients) and Influenza A virus infection. QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.