Topics: Dynamic Programming & Pairwise Alignment of Sequences

Dynamic Programming

For problems that can be broken into subproblems (and those subproblems occur mulitple times), dynamic programming is often a good approach to compute the answer quickly. The textbook introduces the Manhattan Tourist Problem, which we will implement, and then use the framework developed for it to implement the Needleman-Wunsch algorithm for global pairwise sequence alignment.

Implementing the Manhattan Tourist:

The textbook introduces the Manhattan Tourist problem, where you have a grid of streets with landmarks, and you would like to see as many as possible, without retracing your steps. For example, you start at Columbus Circle (59th Street and 8th Avenue) and finish at 42nd Street and 3rd Avenue:

We can apply this approach to an arbitrarily labeled grid (assume the upper corner has coordinates (0,0) and each increases and the coordinates increase as you progress south and east:

The steps we outlined were:

  1. Store best path so far in a grid (array).
  2. Fill in the first row and column separately (since there's only one direction from which to reach those cells)
  3. Fill in the second row (using the first).
  4. Fill in the third row (using the second).
  5. Repeat until all rows are filled.
  6. The most number of landmarks you can visit is in the bottom right cell.

To do this, we will need:

Let's fill in the steps above (full version, including trace back below is at tourist.py. First, we need to set up the variables:
import numpy as np

#The number of landmarks going east and south from a location:
east = np.array( [[3,2,4,0],[3,2,4,2],[0,7,3,4],[3,3,0,2],[1,3,2,2]] )
south = np.array( [[1,0,2,4,3],[4,6,5,2,1],[4,4,5,2,1],[5,6,8,5,3]] )
size = 5
best = np.zeros( (size,size) )

#Check to make sure this matches figure:
print "east:", east
print "south:", south

Now, let's fill out first row and column of the grid:

so that the upper corner is 0, and the elements in the first column depend just on the elements above them, and similarly, the elements in the first row depend on those to their left:

best[0,0] = 0
for i in range(1,size):
     best[0,i] = best[0,i-1] + east[0,i-1]
     best[i,0] = best[i-1,0] + south[i-1,0]

#Check that it works:
print best     

To fill in the rest of the array, we use the larger of the total seen entering going east and entering going south:

#Fill in the rest of the array:
for i in range(1,size):
     for j in range(1,size):
          best[i,j] = max(best[i,j-1] + east[i,j-1], best[i-1,j] + south[i-1,j])

#Check that it works:
print best

Once we check that the above works, then we can add in the code that "reads off" the path from computatations. The easiest way to do this is to create a second matrix, traceback that keeps track of the direction from which the best path enters each element. To the top of the file, let's add code that creates a 2D array of dimensions (size,size) to hold the traceback. numpy arrays only take numbers, but when figuring out the path, it is easier to think of words "UP" and "LEFT" (then numbers. To make the code easier to read, we will create variables called UP = 1 and LEFT = -1 and use those:

traceback = np.zeros( (size,size) )
UP = 1
LEFT = -1

Now, when we compute the values for best, let's also keep track of where they came:

#Set up first row and first column to gap:
best[0,0] = 0
for i in range(1,size):
     best[0,i] = best[0,i-1] + east[0,i-1]
     best[i,0] = best[i-1,0] + south[i-1,0]
     traceback[i,0] = UP
     traceback[0,i] = LEFT

print traceback

#Fill in the rest of the array:
for i in range(1,size):
     for j in range(1,size):
          best[i,j] = max(best[i,j-1] + east[i,j-1], best[i-1,j] + south[i-1,j])
          if best[i,j-1] + east[i,j-1] > best[i-1,j] + south[i-1,j]:
               traceback[i,j] = LEFT
          else:
               traceback[i,j] = UP

print traceback

Once we have the traceback, we can use it to create the path. We'll start at the lower left corner (i.e. (i,j) = (size-1,size-1)) and then we write down the value we stored in the traceback for it, go to that point, write down its value, and repeat until we reach the origin:

path = ""
i = size-1
j = size-1
while i > 0 or j > 0:
     if traceback[i,j] == UP:
          path = "south " + path
          i = i - 1
     else:
          path = "east " + path
          j = j -1

print path

Check to make sure that your path is correct before continuing to sequence alignment.

Alignment of Sequences

The same underlying idea works for aligning sequences. Below, we will implement Needleman-Wunsch's algorithm for global alignment of sequences (a complete version of this program is in globalAlign.py.

Global Alignment of Sequences

Let's modify the previous program to implement the Smith-Waterman algorithm. We'll follow the same format as for the Manhattan Tourist problem. First, we will set up variables for: Here's the first code that does this:
import numpy as np

#Some test sequences:
seq1 = "TATATAGG"
seq2 = "TAGAG"

size1 = len(seq1)+1
size2 = len(seq2)+1
best = np.zeros( (size1,size2) )
traceback = np.zeros( (size1,size2) )

UP = 1
LEFT = -1
DIAG = 2

For alignment, we need to specify the gap penalty (delta) and the scoring function, (sigma(x,y)):

delta = 1

def sigma(x,y):
     if x == y:
          return 1
     else:
          return -1

Now that we have the preliminaries set up, let's fill in the first row and column of the matrices:

#Set up first row and first column to gap:
best[0,0] = 0
for i in range(1,size1):
     best[i,0] = best[i-1,0] - delta
     traceback[i,0] = UP

for i in range(1,size2):
     best[0,i] = best[0,i-1] - delta
     traceback[0,i] = LEFT

print best
print traceback

Before continuing, print out the matrices to make sure they are correct.

Now, as above, we use a nested loop to fill in the remaining entries. In addition to LEFT and UP from the Manhattan Tourist Problem, we also are allowed to travel diagonally (DIAG) which gives one more clause to the definitions:

#Fill in the rest of the array:
for i in range(1,size1):
     for j in range(1,size2):
          best[i,j] = max(best[i,j-1] - delta, \
                          best[i-1,j] - delta, \
                          best[i-1,j-1] + sigma(seq1[i-1],seq2[j-1]))
          if best[i-1,j-1] + sigma(seq1[i-1],seq2[j-1]) == best[i,j]:
               traceback[i,j] = DIAG
          elif best[i-1,j] - delta == best[i,j]:
               traceback[i,j] = UP
          else:
               traceback[i,j] = LEFT
print "best: ", best
print "traceback: ", traceback

Again, before continuing, print out the matrices to make sure they are correct.

Once we have found the path, we need to read off the alignment. As in the above case, we start in the lower right hand corner and work our way back to (0,0):

path1 = ""
path2 = ""
i = size1-1
j = size2-1
while i > 0 or j > 0:
     if traceback[i,j] == UP:
          path1 = seq1[i-1] + path1
          path2 = "-" + path2
          i = i - 1
     elif traceback[i,j] == LEFT:
          path1 = "-" + path1
          path2 = seq2[j-1] + path2
          j = j -1
     else:
          path1 = seq1[i-1] + path1
          path2 = seq2[j-1] + path2
          i -= 1
          j -= 1

print "An alignment is:"
print path1
print path2

Experiment with different sequences to verify that the alignment works.

Local Alignment of Sequences


(Image from Paul Reiners, IBM, 2008)

The Smith-Waterman Algorithm for local alignment is very similar to our local alignment code. It has the following changes:

Challenge

Lab Report

For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:

Target Date: 4 May 2016
Title: Lab 13: Sequence Alignment
Name & Email:

Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and intepretation of your results here.

Using Rosalind

This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.