Alignment Score

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 6285 Experts worldwide ranked by ideXlab platform

Mohammed Javeed Zaki - One of the best experts on this subject based on the ideXlab platform.

  • Indexing protein structures using suffix trees.
    Methods in molecular biology (Clifton N.J.), 2008
    Co-Authors: Feng Gao, Mohammed Javeed Zaki
    Abstract:

    Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain Alignments with database proteins. Similar proteins are selected by their Alignment Score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.

  • PSIST: indexing protein structures using suffix trees
    2005 IEEE Computational Systems Bioinformatics Conference (CSB'05), 2005
    Co-Authors: Mohammed Javeed Zaki
    Abstract:

    Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between C/sub /spl alpha// atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain Alignments with database proteins. Similar proteins are selected by their Alignment Score against the query. Our results shows classification accuracy up to 97.8% and 99.4% at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods.

Xiaoqiu Huang - One of the best experts on this subject based on the ideXlab platform.

  • Sequence-specific sequence comparison using pairwise statistical significance
    Advances in Experimental Medicine and Biology, 2011
    Co-Authors: Xiaoqiu Huang, Ankit Agrawal
    Abstract:

    Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence Alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pair-wise sequence Alignment methods align two sequences using a substitution matrix consisting of pairwise Scores of aligning different residues with each other (like BLOSUM62), and give an Alignment Score for the given sequence-pair. The biologists routinely use such pairwise Alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the Alignment Score rather than by the Alignment Score alone. This research addresses the problem of accurately estimating statistical significance of pairwise Alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence Alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online. With the all-pervasive application of sequence Alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices
    IEEE ACM Transactions on Computational Biology and Bioinformatics, 2011
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence Alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high Alignment Score, but relatedness is usually judged by statistical significance rather than by Alignment Score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise Alignment Scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.

  • Pairwise statistical significance of local sequence Alignment using multiple parameter sets and empirical justification of parameter set change penalty
    BMC Bioinformatics, 2009
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Background Accurate estimation of statistical significance of a pairwise Alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Results Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. Conclusion The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the Alignment Score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for Alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance
    2008 International Conference on Information Technology, 2008
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence Alignment forms the basis of numerous other applications in bioinformatics. The quality of an Alignment is gauged by statistical significance rather than by Alignment Score alone. Therefore, accurate estimation of statistical significance of a pairwise Alignment is an important problem in sequence comparison. Recently, it was shown that pairwise statistical significance does better in practice than database statistical significance, and also provides quicker individual pairwise estimates of statistical significance without having to perform time-consuming database search. Under an evolutionary model, a substitution matrix can be derived using a rate matrix and a fixed distance. Although the commonly used substitution matrices like BLOSUM62, etc. were not originally derived from a rate matrix under an evolutionary model, the corresponding rate matrices can be back calculated. Many researchers have derived different rate matrices using different methods and data. In this paper, we show that pairwise statistical significance using rate matrices with sequence-pair-specific distance performs significantly better compared to using a fixed distance. Pairwise statistical significance using sequence-pair-specific distanced substitution matrices also outperforms database statistical significance reported by BLAST.

Ankit Agrawal - One of the best experts on this subject based on the ideXlab platform.

  • Sequence-specific sequence comparison using pairwise statistical significance
    Advances in Experimental Medicine and Biology, 2011
    Co-Authors: Xiaoqiu Huang, Ankit Agrawal
    Abstract:

    Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence Alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pair-wise sequence Alignment methods align two sequences using a substitution matrix consisting of pairwise Scores of aligning different residues with each other (like BLOSUM62), and give an Alignment Score for the given sequence-pair. The biologists routinely use such pairwise Alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the Alignment Score rather than by the Alignment Score alone. This research addresses the problem of accurately estimating statistical significance of pairwise Alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence Alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online. With the all-pervasive application of sequence Alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices
    IEEE ACM Transactions on Computational Biology and Bioinformatics, 2011
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence Alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high Alignment Score, but relatedness is usually judged by statistical significance rather than by Alignment Score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise Alignment Scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.

  • Pairwise statistical significance of local sequence Alignment using multiple parameter sets and empirical justification of parameter set change penalty
    BMC Bioinformatics, 2009
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Background Accurate estimation of statistical significance of a pairwise Alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Results Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. Conclusion The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the Alignment Score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for Alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.

  • Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance
    2008 International Conference on Information Technology, 2008
    Co-Authors: Ankit Agrawal, Xiaoqiu Huang
    Abstract:

    Pairwise sequence Alignment forms the basis of numerous other applications in bioinformatics. The quality of an Alignment is gauged by statistical significance rather than by Alignment Score alone. Therefore, accurate estimation of statistical significance of a pairwise Alignment is an important problem in sequence comparison. Recently, it was shown that pairwise statistical significance does better in practice than database statistical significance, and also provides quicker individual pairwise estimates of statistical significance without having to perform time-consuming database search. Under an evolutionary model, a substitution matrix can be derived using a rate matrix and a fixed distance. Although the commonly used substitution matrices like BLOSUM62, etc. were not originally derived from a rate matrix under an evolutionary model, the corresponding rate matrices can be back calculated. Many researchers have derived different rate matrices using different methods and data. In this paper, we show that pairwise statistical significance using rate matrices with sequence-pair-specific distance performs significantly better compared to using a fixed distance. Pairwise statistical significance using sequence-pair-specific distanced substitution matrices also outperforms database statistical significance reported by BLAST.

Feng Gao - One of the best experts on this subject based on the ideXlab platform.

  • Indexing protein structures using suffix trees.
    Methods in molecular biology (Clifton N.J.), 2008
    Co-Authors: Feng Gao, Mohammed Javeed Zaki
    Abstract:

    Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain Alignments with database proteins. Similar proteins are selected by their Alignment Score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.

Olivier Bastien - One of the best experts on this subject based on the ideXlab platform.

  • where does the Alignment Score distribution shape come from
    Evolutionary Bioinformatics, 2010
    Co-Authors: Philippe Ortet, Olivier Bastien
    Abstract:

    Alignment algorithms are powerful tools for searching for homologous proteins in databases, providing a Score for each sequence present in the database. It has been well known for 20 years that the shape of the Score distribution looks like an extreme value distribution. The extremely large number of times biologists face this class of distributions raises the question of the evolutionary origin of this probability law. We investigated the possibility of deriving the main properties of sequence Alignment Score distributions from a basic evolutionary process: a duplication-divergence protein evolution process in a sequence space. Firstly, the distribution of sequences in this space was defined with respect to the genetic distance between sequences. Secondly, we derived a basic relation between the genetic distance and the Alignment Score. We obtained a novel Score probability distribution which is qualitatively very similar to that of Karlin-Altschul but performing better than all other previous model.

  • A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-Score probabilities
    BMC Bioinformatics, 2005
    Co-Authors: Olivier Bastien, Philippe Ortet, Eric Maréchal
    Abstract:

    Background Popular methods to reconstruct molecular phylogenies are based on multiple sequence Alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence Alignments, respect probabilistic properties of Z-Scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction. Results We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise Alignment Score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-Score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-Scores. Deduced trees, called TULIP trees, are consistent with multiple-Alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny. Conclusion The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-Score computations.

  • Fundamentals of massive automatic pairwise Alignments of protein sequences: theoretical significance of Z-value statistics
    Bioinformatics, 2004
    Co-Authors: Olivier Bastien, Jean-christophe Aude, Eric Maréchal
    Abstract:

    Motivation:Different automatic methods of sequence Alignments are routinely used as a starting point for homology searches and function inference. Confidence in an Alignment probability is one of the major fundamentals of massive automatic genome-scale pairwise comparisons, for clustering of putative orthologs and paralogs, sequenced genome annotation or multiple-genomic tree constructions. Extreme value distribution based on the Karlin--Altschul model, usually advised for large-scale comparisons are not always valid, particularly in the case of comparisons of non-biased with nucleotide-biased genomes (such that of Plasmodium falciparum). Z-values estimates based on Monte Carlo technics, can be calculated experimentally for any Alignment output, whatever the method used. Empirically, a Z-value higher than ∼8 is supposed reasonable to assess that an Alignment Score is significant, but this arbitrary figure was never theoretically justified. Results: In this paper, we used the Bienayme--Chebyshev inequality to demonstrate a theorem of the upper limit of an Alignment Score probability (or P-value). This theorem implies that a computed Z-value is a statistical test, a single-linkage clustering criterion and that 1/Z-value2 is an upper limit to the probability of an Alignment Score whatever the actual probability law is. Therefore, this study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes).