Computational Biology & Genomicssuzanne Komili

Computational Biology & GenomicsSuzanne Komili

Section 2: Memory Storage, Recursion, Sequence AlignmentOctober 1, 2002

Outline:

How Computers Store Information
Sequence Alignment
Dot Matrix Analysis
Dynamic Programming
Scoring Matrices
BLAST

I) How computers store information

Binary = Number system in base 2

Base 10: 1,234 = 1x103 + 2x102 + 3x101 x 4x100

Base 2: 10110 = 1x24 + 0x23 + 1x22 + 1x21 + 0x20 = 16 + 4 + 2 = 22 (in decimal)

Each binary digit is referred to as a "bit", eight bits equal one "byte"

A 10-digit binary number can hold 210 = 1,024 different values

Example: What is the minimum amount of memory that can hold the Boston phonebook white pages? (Assume:700 pages, 400 names/page, 16 letters/name, 7 decimal digits/phone #)

Step 1: The 26 letters of the alphabet can be encoded in how many bits?

Step 2: How many bits to encode a 7 digit decimal number?

Step 3: How many numbers & letters are needed total?

II) Sequence Alignment

What is sequence alignment?

The procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the set of sequences.

In an optimal alignment, non-identical characters and gaps are placed to line up as many identical or similar characters as possible.

Why sequence alignment?

Sequences that are very much alike, or “similar” in the parlance of sequence analysis, probably have the same function/structure.

Understanding evolutionary events: mutations, insertions and deletions

Types of sequence alignment

Methods of sequence alignment

Dot matrix analysis

The dynamic programming algorithm

k-mer based methods (BLAST and FASTA)

III) Dot Matrix Analysis

Procedure:

A / G / C / T / A / G / G / A
G / X / X / X
A / X / X / X
C / X
T / X
A / X / X / X
G / X / X / X
G / X / X / X
C / X

Advantage:

Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods.

Disadvantage/Limitation:

Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is.

IV) Dynamic Programming

Recursion (taken from CS50 notes)

The basic idea behind recursive algorithms is to break down a problem into smaller pieces that you either already know the answer to or can solve by applying the same algorithm to each piece and then joining together the results. Recursive algorithms break down problems into smaller versions of the same problem. Each recursive algorithm has two cases:

The base case

The base case is a problem that you either already know the answer to or can solve using some other method.

The recursive case

The recursive case is a problem that can be broken down into “smaller” pieces that can be solved individually. The solutions are then combined into the answer for the whole problem. Note that the recursive case must move towards the base by reducing the “size” of the problem.

Example: Factorials

Base case: 1! = 1

Recursive case: n! = n*(n-1)!

10! = 10*9! = 10*9*8! = … = 10*9*…*2*1! = 10*9*…*2*1 = 3,628,800

Background

Dynamic programming (DP) algorithms are a general class of algorithms typically applied to optimization problems.

For DP to be applicable, an optimization problem must have two key ingredients:

Optimal substructure – an optimal solution to the problem contains within it optimal solutions to sub-problems
Overlapping sub-problems – the pieces of larger problem have a sequential dependency.

i.e, you can find the best solution to the problem by breaking it into pieces and solving each small piece first, and then merging the solutions of all pieces.

DP works by first solving every sub-sub-problem just once, and saves its answer in a table, thereby avoiding the work of re-computing the answer every time the sub-sub-problem is encountered. Each intermediate answer is stored with a score, and DP finally chooses the sequence of solution that yields the highest score.

(From Post-Genome Informatics, M. Kanehisa, p. 68)

Both global and local types of alignments may be made by simple changes in the basic DP algorithm.

Alignments depend on the choice of a scoring system for comparing character pairs and penalty scores (e.g. PAM and BLOSUM matrixes – covered later)

Scoring functions – example:

 (match) = +2 or substitution matrix

 (mismatch) = -1 or substitution matrix

 (gap) = -3

Global Alignment (Needleman-Wunsch)

General goal is to obtain optimal global alignment between two sequences, allowing gaps.

From Durbin et al., p 19-20:

We construct a matrix F indexed by i and j, one index for each sequence, where the value F(i,j) is the score of the best alignment between the initial segment x1…iof x up to xiand the initial segment y1…jof y up to yj. … We begin by initializing F(0,0) = 0. We then proceed to fill the matrix from top left to bottom right. If F(i-1, j-1), F(i-1,j) and F(i, j-1)are known, it is possible to calculate F(i,j).

F(i,j) = max { F(i-1, j-1) + s(xi, yj);

F(i-1,j) – d;

F(i, j-1) – d.}

where s(a,b) is the likelihood score that residues a and b occur as an aligned pair, and d is the gap penalty.

Once you construct the matrix, you trace back the path that leads to F(n,m), which is by definition the best score for an alignment of x1…n to y1…m.

Example: Dynamic programming by hand. Align two sequences ACACTA and AGCACACA and find the optimal global alignment using the following scoring matrix:

 (match) = 2

 (mismatch) = -1

 (gap) = -3

- / A / C / A / C / T / A
-
A
G
C
A
C
A
C
A

Local alignment (Smith-Waterman)

Two changes from global alignment:

Possibility of taking the value 0 if all other options have value less than 0. This corresponds to starting a new alignment.
Alignments can end anywhere in the matrix, so instead of taking the value in the bottom right corner, F(n,m) for the best score, we look for the highest value of F(i,j) over the whole matrix and start the trace-back from there.

F(i,j) = max {0;

F(i-1, j-1) + s(xi, yj);

F(i-1,j) – d;F(i, j-1) – d.}

Example: Dynamic programming by hand. Align two sequences ACACTA and AGCACACA and find the optimal local alignment using the following scoring matrix:

 (match) = 2

 (mismatch) = -1

 (gap) = -3

- / A / C / A / C / T / A
-
A
G
C
A
C
A
C
A

Advantage:

Guaranteed in a mathematical sense to provide the optimal (very best or highest-scoring) alignment for a given set of scoring functions.

Disadvantages:

Slow due to the very large number of computational steps: O(n2)

Computer memory requirement also increases as the square of the sequence lengths.

Therefore, it is difficult to use the method for very long sequences.

V) Scoring Matrices

A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment of related sequences.

Not all substitutions are created equal:

Mutations tend to favor a given set of substitutions

Selections tend to favor a (non-identical) set of substitutions

Commonly used scoring matrices:

PAM (Percent Accepted Mutation) matrices

Ranks AA substitutions on how well they’re “accepted” by natural section, without changing the function of the protein
Derived from global alignments of closely related sequences (85% identical).

Based on a mutational model of evolution – matrices for greater evolutionary distances are extrapolated from those for lesser ones.
(Assumption: more distant changes are a reflection of the short-term changes occurring over and over again.)
The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Created via matrix-multiplication of PAM1 (e.g., PAM50 = (PAM1)50 ).
Examples: D  E = 3; D  C = -5; D  L = -4

PAM250 BLOSUM62

BLOSUM (Blocks Amino Acid Substitution Matrices)

Based on a much larger data set than the PAM matrices
Derived from local, ungappedalignments ofbiochemically relatedsequences (sequences can be similar OR different)
All matrices are directly calculated; no extrapolations are used
The number after the matrix (BLOSUM62) refers to the upper limit ofsequence similarity in the alignments used to compile the matrices.

BLOSUM vs. PAM

BLOSUM: more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.
PAM: more tolerant of substitutions to or from hydrophilic residues
The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches
The BLOSUM matrices are more suitable for similarity searches in databases than PAM matrices
PAM incorporates evolutionary model in result (i.e. amount of divergence causing sequence differences)

VI) BLAST (Basic Local Alignment Search Tool)

Web server at

Has excellent information and tutorial

Database searching for similar sequences:

Typical application: Given a new sequence and a database of known sequences, try to find which sequences are related to the new sequence, with a measure of relatedness.

Most sequences will be completely unrelated to the query

Individual alignments needs not be perfect, can always find-tune later

Comparison must be fast – BLAST and FASTA are at least 50 times faster than dynamic programming

DNA vs. protein sequences – always translate DNA sequences that encode proteins into protein sequences before performing a database search.

BLAST and FASTA follow a heuristic (tried-and-true) method that always works to find related sequence in a database search but does not have the underlying guarantee of an optimal solution like the DP algorithm.

BLAST algorithm

Underlying idea: True match alignments are very likely to contain somewhere within them a short stretch of identities, or very high scoring matches.

K-mer “words”: create lists of “words” of length K (default is 3AA’s, 11 nucleotides) from every possible starting point, e.g.: ALKTFL has ‘words’ ALK, LKT, KTF, TFL

K-mer hashing: content-based indexing

For every k-mer, list every location in the database where it occurs. Process database once.
Hash the query into overlapping k-mers and look them (as well as “neighborhood words” above threshold T) up in the database – seeds.
BLAST2 – use lower threshold, connect “close” seeds into single long sequences

 Extend seeds until the cumulative score drops

High-scoring Segment Pair = locally maximal segment above a given threshold

Evaluate statistical significance of HSP

Bit score: independent of the scoring matrix used
E-value: “Expected number of High-Scoring Segment Pairs”; E-values of 0.1 or 0.05 are typically used as cutoffs in sequence database searches

Acknowledgement: This handout includes material written by Yonatan Grad, Doug Selinger, and Zhou Zhu.