Sequence alignment

From Free net encyclopedia

Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (usually denoted by dashes) so that wherever possible, columns contain identical or similar characters from the sequences involved:

 tcctctgcctctgccatcat---caaccccaaagt
 |||| ||| ||||| |||||   ||||||||||||
 tcctgtgcatctgcaatcatgggcaaccccaaagt

It is usually used to study the evolution of the sequences from a common ancestor, especially biological sequences such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment can also be used to study things like the evolution of languages and the similarity between texts.

The term sequence alignment may also refer to the process of constructing such alignment or finding significant alignments in a database of potentially unrelated sequences.

Contents

Pairwise alignment

Pairwise sequence alignment methods are concerned with finding the best-matching piecewise (local) or global alignments of protein (amino acid) or DNA (nucleic acid) sequences.

Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions. The most important application of pairwise alignment is identification of sequences of unknown structure or function. Another important use is the study of molecular evolution.

Two algorithms for pairwise alignment are the Needleman-Wunsch algorithm and the Smith Waterman algorithm.

Global alignment

A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Further, there are several complications to molecular evolution (such as domain shuffling - see below) which prevent these methods from being useful.

Local alignment

Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.

This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.

Significance of alignment

The typical assumption in the use of alignment is the mechanism of molecular evolution. DNA carries over genetic material from generation to generation, by virtue of its semi-conservative duplication mechanism. Changes in the material are introduced by occasional errors and mutations in the duplication, and by viruses and other mechanisms which sometimes move sub-sequences within the chromosome and between individuals. Consequently, an alignment between sequences indicates that the sequences evolved from a common ancestor which contained the matching subsequences. In the case of genetic sequences, it implies that their carriers evolved from a common ancestor.

Using assumptions about the probabilities of these change events, we can estimate the time when sequences diverged from a common ancestor or the time required for changing one sequence into another. However, there is disagreement about the value and nature of these probabilities for biological evolution. One school of thought assumes a simple, constant rate of change (an application of Occam's Razor) while the other school assumes short evolutionary periods when changes were extremely high.

Many genetic changes are evolutionarily non-significant (that is, neutral, neither positive or negative) in that they do not affect the selective advantage (the fitness) of the carrier. This is the basis of Kimura's Neutral theory. This is partly supported by examples such as the Globin protein sequence which shows normal functionality in-spite of several changes at various bases but loses functionality only when changes occur at critical sequence locations.

The actual biological meaning of any alignment can never be absolutely guaranteed. However, statistical methods can be used to assess the likelihood of finding an alignment between two regions (or sequences) by chance, given the size of the database and its composition.

Two important related issues for sequence alignment are:

  1. How should we choose the best alignment between two sequences (or regions of sequences)?
  2. How do we rank the alignments between a query and the many sequences in the database, according to their significance in the domain (such as biological significance)?

In studying evolution, the first question can be addressed by developing a model of how likely are certain changes between characters in the sequence. There are many ways to do this, none of which is generally superior. The models are derived empirically using related sequences, and are expressed as substitution matrices. These matrices, along with gap penalties, are used by the algorithms named below to evaluate alternative alignments between two sequences. The actual biological quality of the alignments then depends upon the evolutionary model used to generate the score.

The second question is purely statistical. Research has determined a few hard theoretical rules and many approximations. It is now generally accepted that the scores of alignments between random sequences follow the extreme value distribution. Pairwise alignment programs such as BLAST use simulation to estimate the parameters of this distribution given a particular query, database, substitution matrix and certain other parameters. Alignments can then be given a statistical significance value, allowing inference of possible relationships between sequences.

Structural alignment

One potential way to validate sequence alignments is to use structural alignments where possible. Because protein structure is more conserved through evolution than is sequence (as shown by Cyrus Chothia and Arthur Lesk in 1986), structural alignments can be more reliable over long evolutionary distances, when the sequences have diverged so much that simple sequence comparison cannot detect their similarity. However, although structural alignments are useful, they are completely artifactual. Structure-based sequence alignments are useful in identifying structurally-conserved regions among a family of closely- or distantly-related proteins when visualization software is available.

Multiple alignment

Multiple alignment is an extension of pairwise alignment to incorporate more than two sequences into an alignment. Multiple alignment methods try to align all of the sequences in a specified set. Alignments help in the identification of common regions between the sequences. There are several approaches to creating multiple sequence alignments, one of the most popular being the progressive alignment strategy used by the Clustal family of programs. Clustal is used in cladistics to build phylogenetic trees, and to build sequence profiles which are used by PSI-BLAST and Hidden Markov model- (HMM-) methods to search sequence databases for more distant relatives.

Multiple sequence alignment is computationally difficult and is classified as an NP-Hard problem.

Algorithms

Needleman-Wunsch

Pairwise global alignment only.

Smith-Waterman

Pairwise, local or global alignment.

Framesearch

This is an extension of Smith-Waterman, for pairwise alignment between a protein sequence and a nucleotide sequence.

Framesearch dynamically considers every possible single-nucleotide insertion or deletion (indel) to generate the translation that best matches the protein sequence. This is unlike six-frame-translated BLAST or Smith-Waterman, which generate all six possible translations of each nucleotide sequence and use them for protein search. This allows dramatically better alignments when the nucleotide sequence contains many indels, as is often the case with single-read expressed sequence tag sequences and very early versions of genomic sequences.

Unfortunately Framesearch is extremely CPU-intensive. Its practical use typically requires a large Computer cluster or specialized computer hardware optimized for running dynamic programming algorithms on a large scale. Several companies sell such systems, which use either custom integrated circuits or Field-programmable gate array technology.

Software

SSearch

Implements the standard Smith-Waterman algorithm. It is considerably slower than the more modern BLAST and FASTA methods. However, Smith-Waterman remains the gold standard for protein-protein or nucleotide-nucleotide pairwise alignment; its speed can be improved by using specialized hardware or a computer cluster.

BLAST (Basic Local Alignment Search Tool)

Main article: BLAST

This method uses a pre-computed hash table to serve as an index for short sequences. Given a query sequence, the sub-sequences are looked up in the index to reduce the amount of time and searching involved. Several parameters need to be provided to make this method faster or more accurate. Once patterns that match the search sequence are found, more accurate and intensive algorithms may be applied.

BLAST uses a pairwise local search and uses a number of methods to increase the speed of the original Smith-Waterman algorithm.

FASTA

Main article: FASTA

Pairwise local search. Much slower but more sensitive than BLAST. Recent versions of the Fasta suite include specialized algorithms for translated searches that can align low-quality nucleotide sequence to protein sequence despite large numbers of indels (translated Blast searches do not handle insertion or deletion errors very well).

Clustal

Main article: Clustal

Progressive multiple alignment method. Comes in several varieties (ClustalW, ClustalX etc.)

POY

POY employs an approach which attempts to simultaneously optimize both the phylogenetic tree and the sequence alignment.

External links

POA - ProbCons - T-Coffee

nl:Sequentie-alignering sv:Linjering vi:Bắt cặp trình tự zh:序列比對