Topics: Dynamic Programming & Pairwise Alignment of Sequences
We can apply this approach to an arbitrarily labeled grid (assume the upper corner has coordinates (0,0) and each increases and the coordinates increase as you progress south and east:
The steps we outlined were:
To do this, we will need:
import numpy as np #The number of landmarks going east and south from a location: east = np.array( [[3,2,4,0],[3,2,4,2],[0,7,3,4],[3,3,0,2],[1,3,2,2]] ) south = np.array( [[1,0,2,4,3],[4,6,5,2,1],[4,4,5,2,1],[5,6,8,5,3]] ) size = 5 best = np.zeros( (size,size) ) #Check to make sure this matches figure: print "east:", east print "south:", south
Now, let's fill out first row and column of the grid:
so that the upper corner is 0, and the elements in the first column depend just on the elements above them, and similarly, the elements in the first row depend on those to their left:
best[0,0] = 0 for i in range(1,size): best[0,i] = best[0,i-1] + east[0,i-1] best[i,0] = best[i-1,0] + south[i-1,0] #Check that it works: print best
To fill in the rest of the array, we use the larger of the total seen entering going east and entering going south:
#Fill in the rest of the array: for i in range(1,size): for j in range(1,size): best[i,j] = max(best[i,j-1] + east[i,j-1], best[i-1,j] + south[i-1,j]) #Check that it works: print best
Once we check that the above works, then we can add in the code that "reads off" the path from computatations. The easiest way to do this is to create a second matrix, traceback that keeps track of the direction from which the best path enters each element. To the top of the file, let's add code that creates a 2D array of dimensions (size,size) to hold the traceback. numpy arrays only take numbers, but when figuring out the path, it is easier to think of words "UP" and "LEFT" (then numbers. To make the code easier to read, we will create variables called UP = 1 and LEFT = -1 and use those:
traceback = np.zeros( (size,size) ) UP = 1 LEFT = -1
Now, when we compute the values for best, let's also keep track of where they came:
#Set up first row and first column to gap: best[0,0] = 0 for i in range(1,size): best[0,i] = best[0,i-1] + east[0,i-1] best[i,0] = best[i-1,0] + south[i-1,0] traceback[i,0] = UP traceback[0,i] = LEFT print traceback #Fill in the rest of the array: for i in range(1,size): for j in range(1,size): best[i,j] = max(best[i,j-1] + east[i,j-1], best[i-1,j] + south[i-1,j]) if best[i,j-1] + east[i,j-1] > best[i-1,j] + south[i-1,j]: traceback[i,j] = LEFT else: traceback[i,j] = UP print traceback
Once we have the traceback, we can use it to create the path. We'll start at the lower left corner (i.e. (i,j) = (size-1,size-1)) and then we write down the value we stored in the traceback for it, go to that point, write down its value, and repeat until we reach the origin:
path = "" i = size-1 j = size-1 while i > 0 or j > 0: if traceback[i,j] == UP: path = "south " + path i = i - 1 else: path = "east " + path j = j -1 print path
Check to make sure that your path is correct before continuing to sequence alignment.
The same underlying idea works for aligning sequences. Below, we will implement Needleman-Wunsch's algorithm for global alignment of sequences (a complete version of this program is in globalAlign.py.
import numpy as np #Some test sequences: seq1 = "TATATAGG" seq2 = "TAGAG" size1 = len(seq1)+1 size2 = len(seq2)+1 best = np.zeros( (size1,size2) ) traceback = np.zeros( (size1,size2) ) UP = 1 LEFT = -1 DIAG = 2
For alignment, we need to specify the gap penalty (delta) and the scoring function, (sigma(x,y)):
delta = 1 def sigma(x,y): if x == y: return 1 else: return -1
Now that we have the preliminaries set up, let's fill in the first row and column of the matrices:
#Set up first row and first column to gap: best[0,0] = 0 for i in range(1,size1): best[i,0] = best[i-1,0] - delta traceback[i,0] = UP for i in range(1,size2): best[0,i] = best[0,i-1] - delta traceback[0,i] = LEFT print best print traceback
Before continuing, print out the matrices to make sure they are correct.
Now, as above, we use a nested loop to fill in the remaining entries. In addition to LEFT and UP from the Manhattan Tourist Problem, we also are allowed to travel diagonally (DIAG) which gives one more clause to the definitions:
#Fill in the rest of the array: for i in range(1,size1): for j in range(1,size2): best[i,j] = max(best[i,j-1] - delta, \ best[i-1,j] - delta, \ best[i-1,j-1] + sigma(seq1[i-1],seq2[j-1])) if best[i-1,j-1] + sigma(seq1[i-1],seq2[j-1]) == best[i,j]: traceback[i,j] = DIAG elif best[i-1,j] - delta == best[i,j]: traceback[i,j] = UP else: traceback[i,j] = LEFT print "best: ", best print "traceback: ", traceback
Again, before continuing, print out the matrices to make sure they are correct.
Once we have found the path, we need to read off the alignment. As in the above case, we start in the lower right hand corner and work our way back to (0,0):
path1 = "" path2 = "" i = size1-1 j = size2-1 while i > 0 or j > 0: if traceback[i,j] == UP: path1 = seq1[i-1] + path1 path2 = "-" + path2 i = i - 1 elif traceback[i,j] == LEFT: path1 = "-" + path1 path2 = seq2[j-1] + path2 j = j -1 else: path1 = seq1[i-1] + path1 path2 = seq2[j-1] + path2 i -= 1 j -= 1 print "An alignment is:" print path1 print path2
Experiment with different sequences to verify that the alignment works.
(Image from Paul Reiners, IBM, 2008)
The Smith-Waterman Algorithm for local alignment is very similar to our local alignment code. It has the following changes:
For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:
Target Date: 4 May 2016
Title: Lab 13: Sequence Alignment
Name & Email:
Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and intepretation of your results here.
This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.