Multiple Sequence Alignment

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 28845 Experts worldwide ranked by ideXlab platform

Cedric Notredame - One of the best experts on this subject based on the ideXlab platform.

  • tcs a new Multiple Sequence Alignment reliability measure to estimate Alignment accuracy and improve phylogenetic tree reconstruction
    Molecular Biology and Evolution, 2014
    Co-Authors: Jiaming Chang, Paolo Di Tommaso, Cedric Notredame
    Abstract:

    : Multiple Sequence Alignment (MSA) is a key modeling procedure when analyzing biological Sequences. Homology and evolutionary modeling are the most common applications of MSAs. Both are known to be sensitive to the underlying MSA accuracy. In this work, we show how this problem can be partly overcome using the transitive consistency score (TCS), an extended version of the T-Coffee scoring scheme. Using this local evaluation function, we show that one can identify the most reliable portions of an MSA, as judged from BAliBASE and PREFAB structure-based reference Alignments. We also show how this measure can be used to improve phylogenetic tree reconstruction using both an established simulated data set and a novel empirical yeast data set. For this purpose, we describe a novel lossless alternative to site filtering that involves overweighting the trustworthy columns. Our approach relies on the T-Coffee framework; it uses libraries of pairwise Alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. We compared TCS with Heads-or-Tails, GUIDANCE, Gblocks, and trimAl and found it to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees. The software is available from www.tcoffee.org/Projects/tcs.

  • Accurate Multiple Sequence Alignment of transmembrane proteins with PSI-Coffee
    BMC Bioinformatics, 2012
    Co-Authors: Jiaming Chang, Paolo Di Tommaso, Jean-françois Taly, Cedric Notredame
    Abstract:

    Background Transmembrane proteins (TMPs) constitute about 20~30% of all protein coding genes. The relative lack of experimental structure has so far made it hard to develop specific Alignment methods and the current state of the art (PRALINE™) only manages to recapitulate 50% of the positions in the reference Alignments available from the BAliBASE2-ref7. Methods We show how homology extension can be adapted and combined with a consistency based approach in order to significantly improve the Multiple Sequence Alignment of alpha-helical TMPs. TM-Coffee is a special mode of PSI-Coffee able to efficiently align TMPs, while using a reduced reference database for homology extension. Results Our benchmarking on BAliBASE2-ref7 alpha-helical TMPs shows a significant improvement over the most accurate methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. We also estimated the influence of the database used for homology extension and show that highly non-redundant UniRef databases can be used to obtain similar results at a significantly reduced computational cost over full protein databases. TM-Coffee is part of the T-Coffee package, a web server is also available from http://tcoffee.crg.cat/tmcoffee and a freeware open source code can be downloaded from http://www.tcoffee.org/Packages/Stable/Latest .

  • upcoming challenges for Multiple Sequence Alignment methods in the high throughput era
    Bioinformatics, 2009
    Co-Authors: Carsten Kemena, Cedric Notredame
    Abstract:

    This review focuses on recent trends in Multiple Sequence Alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based Multiple Sequence Alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for Multiple Sequence Alignment methods in the genomic era, most notably the need to cope with very large Sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed Sequences and finally, the need to integrate many alternative methods and approaches. Contact: cedric.notredame@crg.es

  • recent evolutions of Multiple Sequence Alignment algorithms
    PLOS Computational Biology, 2007
    Co-Authors: Cedric Notredame
    Abstract:

    An ever-increasing number of biological modeling methods depend on the assembly of an accurate Multiple Sequence Alignment (MSA). These include phylogenetic trees, profiles, and structure prediction. Assembling a suitable MSA is not, however, a trivial task, and none of the existing methods have yet managed to deliver biologically perfect MSAs. Many of the algorithms published these last years have been extensively described [1–3], and this review focuses only on the latest developments, including meta-methods and template-based Alignment techniques.

  • recent progress in Multiple Sequence Alignment a survey
    Pharmacogenomics, 2002
    Co-Authors: Cedric Notredame
    Abstract:

    The assembly of a Multiple Sequence Alignment (MSA) has become one of the most common tasks when dealing with Sequence analysis. Unfortunately, the wide range of available methods and the differences in the results given by these methods makes it hard for a non-specialist to decide which program is best suited for a given purpose. In this review we briefly describe existing techniques and expose the potential strengths and weaknesses of the most widely used Multiple Alignment packages.

Kazutaka Katoh - One of the best experts on this subject based on the ideXlab platform.

  • mafft online service Multiple Sequence Alignment interactive Sequence choice and visualization
    Briefings in Bioinformatics, 2019
    Co-Authors: Kazutaka Katoh, John Rozewicki, Kazunori D Yamada
    Abstract:

    This article describes several features in the MAFFT online service for Multiple Sequence Alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological Sequences are available and the need for MSAs with large numbers of Sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine Sequence data sets and MSAs.

  • a simple method to control over Alignment in the mafft Multiple Sequence Alignment program
    Bioinformatics, 2016
    Co-Authors: Kazutaka Katoh, Daron M Standley
    Abstract:

    Motivation: We present a new feature of the MAFFT Multiple Alignment program for suppressing over-Alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-Alignment is recently becoming greater, as low-quality or noisy Sequences are increasing in protein Sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of Sequences (or groups) in a single Multiple Sequence Alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for Multiple Sequence Alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/Alignment/software/ Contact: pj.ca.u-akaso.cerfi@hotak Supplementary information: Supplementary data are available at Bioinformatics online.

  • MAFFT Multiple Sequence Alignment software version 7: Improvements in performance and usability
    Molecular Biology and Evolution, 2013
    Co-Authors: Kazutaka Katoh, Daron M Standley
    Abstract:

    We report a major update of the MAFFT Multiple Sequence Alignment program. This version has several new features, including options for adding unaligned Sequences into an existing Alignment, adjustment of direction in nucleotide Alignment, constrained Alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misAlignments, and our ongoing efforts to overcome such limitations.

  • parallelization of the mafft Multiple Sequence Alignment program
    Bioinformatics, 2010
    Co-Authors: Kazutaka Katoh, Hiroyuki Toh
    Abstract:

    Summary: Multiple Sequence Alignment (MSA) is an important step in comparative Sequence analyses. Parallelization is a key technique for reducing the time required for large-scale Sequence analyses. The three calculation stages, all-to-all comparison, progressive Alignment and iterative refinement, of the MAFFT MSA program were parallelized using the POSIX Threads library. Two natural parallelization strategies (best-first and simple hill-climbing) were implemented for the iterative refinement stage. Based on comparisons of the objective scores and benchmark scores between the two approaches, we selected a simple hill-climbing approach as the default. Availability: The parallelized version of MAFFT is available at http://mafft.cbrc.jp/Alignment/software/. This version currently supports the Linux operating system only. Contact: kazutaka.katoh@aist.go.jp Supplementary information:Supplementary data are available at Bioinformatics online.

  • recent developments in the mafft Multiple Sequence Alignment program
    Briefings in Bioinformatics, 2008
    Co-Authors: Kazutaka Katoh, Hiroyuki Toh
    Abstract:

    The accuracy and scalability of Multiple Sequence Alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of Sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality Alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive Alignment and the latter improved the accuracy of ncRNA Alignment. We review these and other techniques that MAFFTuses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

Desmond G Higgins - One of the best experts on this subject based on the ideXlab platform.

  • protein Multiple Sequence Alignment benchmarking through secondary structure prediction
    Bioinformatics, 2017
    Co-Authors: Quan Le, Fabian Sievers, Desmond G Higgins
    Abstract:

    Motivation Multiple Sequence Alignment (MSA) is commonly used to analyze sets of homologous protein or DNA Sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' Alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual Alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few Sequences or require manual Alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen Sequences. PREFAB and HomFam both rely on using a small subset of Sequences of known structure and do not fairly test the quality of a full MSA. Results In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure Alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include Sequences of known structure. SSPA measures the quality of an entire Alignment however, not just the accuracy on a handful of selected Sequences. It can be scaled to Alignments of any size but here we demonstrate its use on Alignments of either 200 or 1000 Sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA Alignment options and by including different levels of mis-Alignment into MSA, and examining the effects on the scores. Availability and implementation QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz. Contact quan.le@ucd.ie. Supplementary information Supplementary data are available at Bioinformatics online.

  • Sequence embedding for fast construction of guide trees for Multiple Sequence Alignment
    Algorithms for Molecular Biology, 2010
    Co-Authors: Gordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins
    Abstract:

    Background: The most widely used Multiple Sequence Alignment methods require Sequences to be clustered as an initial step. Most Sequence clustering methods require a full distance matrix to be computed between all pairs of Sequences. This requires memory and time proportional to N2 for N Sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large Multiple Alignments. Results: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the Sequences in a space where the similarities within a set of Sequences can be closely approximated without having to compute all pair-wise distances. Conclusions: We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of Sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for Multiple Alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

  • t coffee a novel method for fast and accurate Multiple Sequence Alignment
    Journal of Molecular Biology, 2000
    Co-Authors: Cedric Notredame, Desmond G Higgins, Jaap Heringa
    Abstract:

    We describe a new method (T-Coffee) for Multiple Sequence Alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to Multiple Alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise Alignments between the Sequences. This provides us with a library of Alignment information that can be used to guide the progressive Alignment. Intermediate Alignments are then based not only on the Sequences to be aligned next but also on how all of the Sequences align with each other. This Alignment information can be derived from heterogeneous sources such as a mixture of Alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise Alignments to generate the library. The resulting Alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the Sequences in the tests.

  • the clustal_x windows interface flexible strategies for Multiple Sequence Alignment aided by quality analysis tools
    Nucleic Acids Research, 1997
    Co-Authors: Julie D Thompson, Frederica Plewniak, Francois Jeanmougin, Toby J Gibson, Desmond G Higgins
    Abstract:

    CLUSTAL X is a new windows interface for the widely-used progressive Multiple Sequence Alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing Multiple Sequence and profile Alignments and analysing the results. CLUSTAL X displays the Sequence Alignment in a window on the screen. A versatile Sequence colouring scheme allows the user to highlight conserved features in the Alignment. Pull-down menus provide all the options required for traditional Multiple Sequence and profile Alignment. New features include: the ability to cut-and-paste Sequences to change the order of the Alignment, selection of a subset of the Sequences to be realigned, and selection of a sub-range of the Alignment to be realigned and inserted back into the original Alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and reAlignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult Alignments and to trap errors in input Sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.

  • clustal w improving the sensitivity of progressive Multiple Sequence Alignment through Sequence weighting position specific gap penalties and weight matrix choice
    Nucleic Acids Research, 1994
    Co-Authors: Julie D Thompson, Desmond G Higgins, Toby J Gibson
    Abstract:

    Thesensitivity ofthecommonlyusedprogressive Multiple Sequence Alignment methodhasbeengreatly improved fortheAlignment ofdivergent protein Sequences. Firstly, individual weights areassigned to eachSequence inapartial Alignment inorder todownweightnear-duplicate Sequences andup-weight the mostdivergent ones. Secondly, aminoacid substitution matrices arevaried atdifferent Alignment stages according tothedivergence oftheSequences tobe aligned. Thirdly, residue-specific gappenalties and locally reduced gappenalties inhydrophilic regions encourage newgapsinpotential loopregions rather thanregular secondary structure. Fourthly, positions inearly Alignments wheregapshavebeenopened receive locally reduced gappenalties toencourage the opening upofnewgapsatthesepositions. These modifications areincorporated intoanewprogram, CLUSTALW whichisfreely available.

Jaebum Kim - One of the best experts on this subject based on the ideXlab platform.

  • psar align improving Multiple Sequence Alignment using probabilistic sampling
    Bioinformatics, 2014
    Co-Authors: Jaebum Kim
    Abstract:

    Summary: We developed PSAR-Align, a Multiple Sequence reAlignment tool that can refine a given Multiple Sequence Alignment based on suboptimal Alignments generated by probabilistic sampling. Our evaluation demonstrated that PSAR-Align is able to improve the results from various Multiple Sequence Alignment tools. Availability and implementation: The PSAR-Align source code (implemented mainly in Cþþ) is freely available for download at

  • psar measuring Multiple Sequence Alignment reliability by probabilistic sampling
    Nucleic Acids Research, 2011
    Co-Authors: Jaebum Kim
    Abstract:

    Multiple Sequence Alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the Alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based Alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between Alignment quality and guide tree uncertainty in progressive Alignment methods, we directly generate suboptimal Alignments from an input Multiple Sequence Alignment by a probabilistic sampling method, and compute the agreement of the input Alignment with the suboptimal Alignments as the Alignment reliability score. We construct the suboptimal Alignments by an approximate method that is based on pairwise comparisons between each single Sequence and the sub-Alignment of the input Alignment where the chosen Sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal Alignments are highly informative source for assessing Alignment reliability. We apply the PSAR method to the Alignments in the UCSC Genome Browser to measure the reliability of Alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.

  • psar measuring Multiple Sequence Alignment reliability by probabilistic sampling
    Research in Computational Molecular Biology, 2011
    Co-Authors: Jaebum Kim
    Abstract:

    Multiple Sequence Alignment (MSA), which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the Alignments and incorporate it into downstream analyses. Many studies have been conducted to find the extent, cause and effect of the Alignment errors [4], and to heuristically estimate the quality of Alignments without using the true Alignment, which is unknown [2]. However, it is still unclear whether the heuristically chosen measures are general enough to take into account all Alignment errors. In this paper, we present a new Alignment reliability score, called PSAR (Probabilistic Sampling-based Alignment Reliability) score.

Hiroyuki Toh - One of the best experts on this subject based on the ideXlab platform.

  • parallelization of the mafft Multiple Sequence Alignment program
    Bioinformatics, 2010
    Co-Authors: Kazutaka Katoh, Hiroyuki Toh
    Abstract:

    Summary: Multiple Sequence Alignment (MSA) is an important step in comparative Sequence analyses. Parallelization is a key technique for reducing the time required for large-scale Sequence analyses. The three calculation stages, all-to-all comparison, progressive Alignment and iterative refinement, of the MAFFT MSA program were parallelized using the POSIX Threads library. Two natural parallelization strategies (best-first and simple hill-climbing) were implemented for the iterative refinement stage. Based on comparisons of the objective scores and benchmark scores between the two approaches, we selected a simple hill-climbing approach as the default. Availability: The parallelized version of MAFFT is available at http://mafft.cbrc.jp/Alignment/software/. This version currently supports the Linux operating system only. Contact: kazutaka.katoh@aist.go.jp Supplementary information:Supplementary data are available at Bioinformatics online.

  • recent developments in the mafft Multiple Sequence Alignment program
    Briefings in Bioinformatics, 2008
    Co-Authors: Kazutaka Katoh, Hiroyuki Toh
    Abstract:

    The accuracy and scalability of Multiple Sequence Alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of Sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality Alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive Alignment and the latter improved the accuracy of ncRNA Alignment. We review these and other techniques that MAFFTuses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

  • MAFFT version 5: Improvement in accuracy of Multiple Sequence Alignment
    Nucleic Acids Research, 2005
    Co-Authors: Kazutaka Katoh, Hiroyuki Toh, Kei Ichi Kuma, Takashi Miyata
    Abstract:

    The accuracy of Multiple Sequence Alignment program MAFFT has been improved. The new version (5.3) of MAFFT offers new iterative refinement options, H-INS-i, F-INS-i and G-INS-i, in which pairwise Alignment information are incorporated into objective function. These new options of MAFFT showed higher accuracy than currently available methods including TCoffee version 2 and CLUSTAL W in benchmark tests consisting of Alignments of >50 Sequences. Like the previously available options, the new options of MAFFT can handle hundreds of Sequences on a standard desktop computer. We also examined the effect of the number of homologues included in an Alignment. For a Multiple Alignment consisting of ∼8 Sequences with low similarity, the accuracy was improved (2–10 percentage points) when the Sequences were aligned together with dozens of their close homologues (E-value < 10−5–10−20) collected from a database. Such improvement was generally observed for most methods, but remarkably large for the new options of MAFFT proposed here. Thus, we made a Ruby script, mafftE.rb, which aligns the input Sequences together with their close homologues collected from SwissProt using NCBI-BLAST.

  • improvement in the accuracy of Multiple Sequence Alignment program mafft
    Genome Informatics, 2005
    Co-Authors: Kazutaka Katoh, Kei Ichi Kuma, Takashi Miyata, Hiroyuki Toh
    Abstract:

    In 2002, we developed and released a rapid Multiple Sequence Alignment program MAFFT that was designed to handle a huge (up to approximately 5,000 Sequences) and long data (approximately 2,000 aa or approximately 5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i in this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise Alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein Alignments) and BRAliBASE (for RNA Alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-i became greater for the Alignments consisting of approximately 50-100 Sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to approximately 200 Sequences, although not applicable to thousands of Sequences because of time and space complexities.