Statistical Significance

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 424044 Experts worldwide ranked by ideXlab platform

Ankit Agrawal - One of the best experts on this subject based on the ideXlab platform.

  • accelerating pairwise Statistical Significance estimation for local alignment by harvesting gpu s power
    BMC Bioinformatics, 2012
    Co-Authors: Ankit Agrawal, Sanchit Misra, Yuhong Zhang, Md Mostofa Ali Patwary, Weikeng Liao, Alok Choudhary
    Abstract:

    Background Pairwise Statistical Significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability.

  • par psse software for pairwise Statistical Significance estimation in parallel for local sequence alignment
    International Journal of Digital Content Technology and Its Applications, 2012
    Co-Authors: Ankit Agrawal, Sanchit Misra, Yuhong Zhang, Md Mostofa Ali Patwary, Weikeng Liao, Alok Choudhary
    Abstract:

    Pairwise Statistical Significance (PSS) has been recognized as a very useful method for homology detection. It can help in estimating whether the output of sequence alignment is evolutionarily link or just arisen by accident. However, pairwise Statistical Significance estimation (PSSE) poses a big challenge in terms of performance and scalability since it is both computationally intensive and data intensive to construct the empirical score distribution during the estimation. This paper presents a software library for estimating pairwise Statistical Significance in parallel, named Par-PSSE, implemented in C++ using OpenMP, MPI paradigms and their hybrids. Further, we apply the parallelization technique to estimate non-conservative PSS using standard, sequence-specific, and position-specific substitution matrices. These extensions have been found superior compared to the standard pairwise Statistical Significance in term of retrieval accuracy. Through distributing the compute-intensive kernels of the pairwise Statistical Significance estimation across multiple computational units, we achieve a speedup of up to 621.73× over the corresponding sequential implementation when using1024 cores.

  • parallel pairwise Statistical Significance estimation of local sequence alignment using message passing interface library
    Concurrency and Computation: Practice and Experience, 2011
    Co-Authors: Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary
    Abstract:

    SUMMARY Homology detection is a fundamental step in sequence analysis. In the recent years, pairwise Statistical Significance has emerged as a promising alternative to database Statistical Significance for homology detection. Although more accurate, currently it is much time consuming because it involves generating tens of hundreds of alignment scores to construct the empirical score distribution. This paper presents a parallel algorithm for pairwise Statistical Significance estimation, called MPIPairwiseStatSig, implemented in C using MPI library. We further apply the parallelization technique to estimate non-conservative pairwise Statistical Significance using standard, sequence-specific, and position-specific substitution matrices, which has earlier demonstrated superior sequence comparison accuracy than original pairwise Statistical Significance. Distributing the most compute-intensive portions of the pairwise Statistical Significance estimation procedure across multiple processors has been shown to result in near-linear speed-ups for the application. The MPIPairwiseStatSig program for pairwise Statistical Significance estimation is available for free academic use at www.cs.iastate.edu~ankitag/MPIPairwiseStatSig.html. Copyright © 2011 John Wiley & Sons, Ltd.

  • efficient pairwise Statistical Significance estimation for local sequence alignment using gpu
    International Conference on Computational Advances in Bio and Medical Sciences, 2011
    Co-Authors: Yuhong Zhang, Ankit Agrawal, Sanchit Misra, Daniel Honbo, Weikeng Liao, Alok Choudhary
    Abstract:

    Pairwise Statistical Significance has been found to be quite accurate in identifying related sequences (homologs), which is a key step in numerous bioinformatics applications. However, it is computational and data intensive, particularly for a large amount of sequence data. To prevent it from becoming a performance bottleneck, we resort to Graphics Processing Units (GPUs) for accelerating the computation. In this paper, we present a GPU memory-access optimized implementation for a pairwise Statistical Significance estimation algorithm. By exploring the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous memory accesses pattern to GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. Our experimental results present both single- and multi-pair Statistical Significance estimations. The performance evaluation was carried out on an NVIDIA Telsa C2050 GPU. We observe more than 180× end-to-end speedup over the CPU implementation on an Intel© Core™ i7 processor. The proposed memory access optimizations and efficient framework are also applicable to many other sequence comparison based applications, such as DNA sequence mapping and database search.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices
    IEEE ACM Transactions on Computational Biology and Bioinformatics, 2011
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by Statistical Significance rather than by alignment score. Recently, it was shown that pairwise Statistical Significance gives promising results as an alternative to database Statistical Significance for getting individual Significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the Statistical Significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise Statistical Significance, which is expected to use more sequence-specific information in estimating pairwise Statistical Significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise Statistical Significance is significantly better than using a standard matrix like BLOSUM62, and than database Statistical Significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise Statistical Significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.

Jed A Fuhrman - One of the best experts on this subject based on the ideXlab platform.

  • Statistical Significance approximation in local trend analysis of high-throughput time-series data using the theory of Markov chains
    BMC Bioinformatics, 2015
    Co-Authors: Dongmei Ai, Jacob A. Cram, Xiaoyi Liang, Jed A Fuhrman
    Abstract:

    Local trend (i.e. shape) analysis of time series data reveals co-changing patterns in dynamics of biological systems. However, slow permutation procedures to evaluate the Statistical Significance of local trend scores have limited its applications to high-throughput time series data analysis, e.g., data from the next generation sequencing technology based studies. By extending the theories for the tail probability of the range of sum of Markovian random variables, we propose formulae for approximating the Statistical Significance of local trend scores. Using simulations and real data, we show that the approximate p-value is close to that obtained using a large number of permutations (starting at time points >20 with no delay and >30 with delay of at most three time steps) in that the non-zero decimals of the p-values obtained by the approximation and the permutations are mostly the same when the approximate p-value is less than 0.05. In addition, the approximate p-value is slightly larger than that based on permutations making hypothesis testing based on the approximate p-value conservative. The approximation enables efficient calculation of p-values for pairwise local trend analysis, making large scale all-versus-all comparisons possible. We also propose a hybrid approach by integrating the approximation and permutations to obtain accurate p-values for significantly associated pairs. We further demonstrate its use with the analysis of the Polymouth Marine Laboratory (PML) microbial community time series from high-throughput sequencing data and found interesting organism co-occurrence dynamic patterns. The software tool is integrated into the eLSA software package that now provides accelerated local trend and similarity analysis pipelines for time series data. The package is freely available from the eLSA website: http://bitbucket.org/charade/elsa .

  • efficient Statistical Significance approximation for local similarity analysis of high throughput time series data
    Bioinformatics, 2013
    Co-Authors: Dongmei Ai, Jacob A. Cram, Jed A Fuhrman
    Abstract:

    Motivation: Local similarity analysis of biological time series data helps elucidate the varying dynamics of biological systems. However, its applications to large scale high-throughput data are limited by slow permutation procedures for Statistical Significance evaluation. Results: We developed a theoretical approach to approximate the Statistical Significance of local similarity analysis based on the approximate tail distribution of the maximum partial sum of independent identically distributed (i.i.d.) random variables. Simulations show that the derived formula approximates the tail distribution reasonably well (starting at time points with no delay and with delay) and provides P-values comparable with those from permutations. The new approach enables efficient calculation of Statistical Significance for pairwise local similarity analysis, making possible all-to-all local association studies otherwise prohibitive. As a demonstration, local similarity analysis of human microbiome time series shows that core operational taxonomic units (OTUs) are highly synergetic and some of the associations are body-site specific across samples. Availability: The new approach is implemented in our eLSA package, which now provides pipelines for faster local similarity analysis of time series data. The tool is freely available from eLSA’s website: http://meta.usc.edu/softs/lsa. Supplementary information:Supplementary data are available at Bioinformatics online. Contact: fsun@usc.edu

Eiichiro Fukusaki - One of the best experts on this subject based on the ideXlab platform.

  • method for assessing the Statistical Significance of mass spectral similarities using basic local alignment search tool statistics
    Analytical Chemistry, 2013
    Co-Authors: Fumio Matsuda, Hiroshi Tsugawa, Eiichiro Fukusaki
    Abstract:

    A novel method for assessing the Statistical Significance of mass spectral similarities was developed using modified basic local alignment search tool (BLAST; Karlin–Altschul) statistics. In gas chromatography/mass spectrometry-based metabolomics, many signals in raw metabolome data are identified on the basis of unexpected similarities among mass spectra and the spectra of standards. Since there is inevitably noise in the observed spectra, a list of identified metabolites includes some false positives. In the developed method, electron ionization (EI) mass spectrometry–BLAST, a similarity score of two mass spectra is calculated using a general scoring scheme, from which the probability of obtaining the score by chance (P value) is calculated. For this purpose, a simple rule for converting a unit EI mass spectrum to a mass spectral sequence as well as a score matrix for aligned mass spectral sequences was developed. A Monte Carlo simulation using randomly generated mass spectral sequences demonstrated tha...

  • method for assessing the Statistical Significance of mass spectral similarities using basic local alignment search tool statistics
    Analytical Chemistry, 2013
    Co-Authors: Fumio Matsuda, Hiroshi Tsugawa, Eiichiro Fukusaki
    Abstract:

    A novel method for assessing the Statistical Significance of mass spectral similarities was developed using modified basic local alignment search tool (BLAST; Karlin-Altschul) statistics. In gas chromatography/mass spectrometry-based metabolomics, many signals in raw metabolome data are identified on the basis of unexpected similarities among mass spectra and the spectra of standards. Since there is inevitably noise in the observed spectra, a list of identified metabolites includes some false positives. In the developed method, electron ionization (EI) mass spectrometry-BLAST, a similarity score of two mass spectra is calculated using a general scoring scheme, from which the probability of obtaining the score by chance (P value) is calculated. For this purpose, a simple rule for converting a unit EI mass spectrum to a mass spectral sequence as well as a score matrix for aligned mass spectral sequences was developed. A Monte Carlo simulation using randomly generated mass spectral sequences demonstrated that the null distribution or the expected number of hits (E value) follows modified Karlin-Altschul statistics. A metabolite data set obtained from green tea extract was analyzed using the developed method. Among 171 metabolite signals in the metabolome data, 93 signals were identified on the basis of significant similarities (P < 0.015) with reference data. Since the expected number of false positives is 2.6, the false discovery rate was estimated to be 2.8%, indicating that the search threshold (P < 0.015) is reasonable for metabolite identification.

Xiaoqiu Huang - One of the best experts on this subject based on the ideXlab platform.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices
    IEEE ACM Transactions on Computational Biology and Bioinformatics, 2011
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by Statistical Significance rather than by alignment score. Recently, it was shown that pairwise Statistical Significance gives promising results as an alternative to database Statistical Significance for getting individual Significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the Statistical Significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise Statistical Significance, which is expected to use more sequence-specific information in estimating pairwise Statistical Significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise Statistical Significance is significantly better than using a standard matrix like BLOSUM62, and than database Statistical Significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise Statistical Significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.

  • pairwise Statistical Significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty
    BMC Bioinformatics, 2009
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Background Accurate estimation of Statistical Significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise Statistical Significance with database Statistical Significance was conducted. In this paper, we extend the earlier work on pairwise Statistical Significance by incorporating with it the use of multiple parameter sets.

  • pairwise Statistical Significance and empirical determination of effective gap opening penalties for protein local sequence alignment
    International Journal of Computational Biology and Drug Design, 2008
    Co-Authors: Ankit Agrawal, Volker Brendel, Xiaoqiu Huang
    Abstract:

    We evaluate various methods to estimate pairwise Statistical Significance of a pairwise local sequence alignment in terms of Statistical Significance accuracy and compare it with popular database search programs in terms of retrieval accuracy on a benchmark database. Results indicate that using pairwise Statistical Significance using standard substitution matrices is significantly better than database Statistical Significance reported by BLAST and PSI-BLAST, and that it is comparable and at times significantly better than SSEARCH. An application of pairwise Statistical Significance to empirically determine effective gap opening penalties for protein local sequence alignment using the widely used BLOSUM matrices is also presented.

  • pairwise Statistical Significance versus database Statistical Significance for local alignment of protein sequences
    International Symposium on Bioinformatics Research and Applications, 2008
    Co-Authors: Ankit Agrawal, Volker Brendel, Xiaoqiu Huang
    Abstract:

    An important aspect of pairwise sequence comparison is assessingthe Statistical Significance of the alignment. Most of the currentlypopular alignment programs report the Statistical Significance ofan alignment in context of a database search. This database StatisticalSignificance is dependent on the database, and hence, the same alignmentof a pair of sequences may be assessed different Statistical Significancevalues in different databases. In this paper, we explore the use of pairwiseStatistical Significance, which is independent of any database, andcan be useful in cases where we only have a pair of sequences and wewant to comment on the relatedness of the sequences, independent of anydatabase. We compared different methods and determined that censoredmaximum likelihood fitting the score distribution right of the peak is themost accurate method for estimating pairwise Statistical Significance. Weevaluated this method in an experiment with a subset of CATH2.3, whichhad been previoulsy used by other authors as a benchmark data set forprotein comparison. Comparison of results with database Statistical Significancereported by popular programs like SSEARCH and PSI-BLAST indicate that the results of pairwise Statistical Significance are comparable,indeed sometimes significantly better than those of database StatisticalSignificance (with SSEARCH). However, PSI-BLAST performs best,presumably due to its use of query-specific substitution matrices.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance
    2008 International Conference on Information Technology, 2008
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence alignment forms the basis of numerous other applications in bioinformatics. The quality of an alignment is gauged by Statistical Significance rather than by alignment score alone. Therefore, accurate estimation of Statistical Significance of a pairwise alignment is an important problem in sequence comparison. Recently, it was shown that pairwise Statistical Significance does better in practice than database Statistical Significance, and also provides quicker individual pairwise estimates of Statistical Significance without having to perform time-consuming database search. Under an evolutionary model, a substitution matrix can be derived using a rate matrix and a fixed distance. Although the commonly used substitution matrices like BLOSUM62, etc. were not originally derived from a rate matrix under an evolutionary model, the corresponding rate matrices can be back calculated. Many researchers have derived different rate matrices using different methods and data. In this paper, we show that pairwise Statistical Significance using rate matrices with sequence-pair-specific distance performs significantly better compared to using a fixed distance. Pairwise Statistical Significance using sequence-pair-specific distanced substitution matrices also outperforms database Statistical Significance reported by BLAST.

Francisco J Rodrigueztovar - One of the best experts on this subject based on the ideXlab platform.

  • spectral and cross spectral analysis of uneven time series with the smoothed lomb scargle periodogram and monte carlo evaluation of Statistical Significance
    Computers & Geosciences, 2012
    Co-Authors: Eulogio Pardoiguzquiza, Francisco J Rodrigueztovar
    Abstract:

    Many spectral analysis techniques have been designed assuming sequences taken with a constant sampling interval. However, there are empirical time series in the geosciences (sediment cores, fossil abundance data, isotope analysis, ...) that do not follow regular sampling because of missing data, gapped data, random sampling or incomplete sequences, among other reasons. In general, interpolating an uneven series in order to obtain a succession with a constant sampling interval alters the spectral content of the series. In such cases it is preferable to follow an approach that works with the uneven data directly, avoiding the need for an explicit interpolation step. The Lomb-Scargle periodogram is a popular choice in such circumstances, as there are programs available in the public domain for its computation. One new computer program for spectral analysis improves the standard Lomb-Scargle periodogram approach in two ways: (1) It explicitly adjusts the Statistical Significance to any bias introduced by variance reduction smoothing, and (2) it uses a permutation test to evaluate confidence levels, which is better suited than parametric methods when neighbouring frequencies are highly correlated. Another novel program for cross-spectral analysis offers the advantage of estimating the Lomb-Scargle cross-periodogram of two uneven time series defined on the same interval, and it evaluates the confidence levels of the estimated cross-spectra by a non-parametric computer intensive permutation test. Thus, the cross-spectrum, the squared coherence spectrum, the phase spectrum, and the Monte Carlo Statistical Significance of the cross-spectrum and the squared-coherence spectrum can be obtained. Both of the programs are written in ANSI Fortran 77, in view of its simplicity and compatibility. The program code is of public domain, provided on the website of the journal (http://www.iamg.org/index.php/publisher/articleview/frmArticleID/112/). Different examples (with simulated and real data) are described in this paper to corroborate the methodology and the implementation of these two new programs.

  • maxenper a program for maximum entropy spectral estimation with assessment of Statistical Significance by the permutation test
    Computers & Geosciences, 2005
    Co-Authors: Eulogio Pardoiguzquiza, Francisco J Rodrigueztovar
    Abstract:

    The maximum entropy spectral estimator is widely used because of its high spectral resolution, but it lacks an easy procedure for evaluating the Statistical Significance of the spectral estimates. We implemented the non-parametric computer intensive permutation test in order to evaluate the Statistical Significance of the maximum entropy spectral estimates. There is the possibility of choosing between an underlying red or white noise in the permutation procedure. Two case studies, with a long and a short time series, illustrate the performance of the method.

  • the permutation test as a non parametric method for testing the Statistical Significance of power spectrum estimation in cyclostratigraphic research
    Earth and Planetary Science Letters, 2000
    Co-Authors: Eulogio Pardoiguzquiza, Francisco J Rodrigueztovar
    Abstract:

    Abstract A computer-intensive Significance test for estimated power spectra of cyclic sedimentary successions is presented. This simple method requires no more than a few minutes in computer time for a PC-486, and does not require distributional assumptions. It is suitable for all the spectral analysis approaches used in practice. Moreover, good performance is achieved with relatively short stratigraphical series. The method is similar to a permutation test that has been successfully applied to other Statistical problems. In the proposed application of the permutation test to the spectral analysis of time series, the data of a stratigraphic sequence are ordered at random (random permutation) and the power spectrum is estimated by the given approach. The process is repeated many times (e.g. 1000 times) and thus it is possible to assess the Statistical Significance of the power spectrum of the original sequence for each frequency. Simulation results and the application to real data are shown in order to discuss the performance of the method.