Topics: Implementing Fitch's Algorithm for the Small Parsimony Problem
Our algorithm will:
Our algorithm operates in two passes:
Trees will be stored as an adjacency list where the underlying structure is a dictionary. Entries of the dictionary are of the form:
nodeName : [parentName, label, [child1,child2,...]]
If our tree is called, t, then we can access the value stored for the "root" by:
print t["root"]which will print out a list with the parent's name (an empty string in the case of the root), its label (will start out as an empty string for non-leaf nodes), and its children. A possible output would be:
["","", ["i1", "a"]]where "i1" and "a" are names of other nodes in the tree. The file smallPars.py contains a tree (as well as the code developed in this lab):
tree = {"root" : ["","",["i1","i2"]], "i1": ["root","", ["a", "i3"]], "i2": ["root","", ["b", "c"]], "i3": ["i1","", ["d","e"]], "a": ["i1","T A A A A", []], "b": ["i2","A A A A G", []], "c": ["i2","C A A A T", []], "d": ["i3","A A A A A", []], "e": ["i3","T A A A A", []]}
Note that we're separating the labels for each basepair by space. This is to make it easier to tell which label is which when we have multiple possibilties during the passes of the algorithm.
Try printing out different nodes in the tree to make sure that there's no typos.
We're going to build up the program piece-by-piece. The first piece that needs to be built is a function that takes the labels of two siblings and returns a possible labeling for their parent. Since we will repeat this with every pair of siblings in the trees, we will take the code that labels the parent from a child and make a function. As a function, we can test it separately to make sure it works. For example, if the labels of siblings are: A T A T G and A A T T G, then the function should compare position by position and return A AT AT T G.
Let's start by creating a function that takes two inputs and returns a test message (we will alternate between testing and adding to the function):
def newLabel(l1,l2): """ Takes two labels and returns the label of their parent. """ return "testing newLabel()"
Try calling the function several times with different inputs to make sure there are no typographical errors.
Next we need to split up the labels into separate positions and store in a list. Let's also create a new list to store the labels as we create them
#Split up the labels into separate positions: b1 = l1.split(" ") b2 = l2.split(" ") #Create a new list to store the parent labels: newL = []
To make sure that this works, change the return statement to be
return b1Also try with b2. Once you've removed all the errors, go on to building the labels for the parents.
We want to compare the labels for each position, so, we'll use a for loop. If the two sequences agree at a position, then we'll add it directly to newL.
for i in range(len(b1)): if b1[i] == b2[i]: newL.append(b1[i])But what if they don't? Then the else clause takes the intersection of the corresponding sets. If the intersection is empty, the union is the label stored in newL. If the intersection is not empty, store it as the label in newL:
else: s1 = set(b1[i]) s2 = set(b2[i]) print "s1:", s1 print "s2:", s2 intersect = s1.intersection(s2) if len(intersect) == 0: union = s1.union(s2) newL.append("".join(union)) else: newL.append("".join(intersect))The function returns the label for the parent as a string:
return " ".join(newL)
Try running this function on several pairs to verify that it gives the correct label for the parent.
Now that we have a function that will produce labels for the parents of pairs of nodes, we can use it to make a first pass through the tree. Again, let's break it into pieces. This time, we'll have cases based on whether the node is a leaf or not: firstPass(t,n)
def firstPass(t, n): #Assumes all leaves have labels #Check if there's a label for the node if t[n][1] != "": return t[n][1] else: #If not, get the label of it's kids: #Assumes binary trees kid1 = t[n][2][0] kid2 = t[n][2][1] newL = newLabel(firstPass(t,kid1),firstPass(t,kid2)) t[n][1] = newL return newL
Let's write the function that takes a parent (with a proper labeling) and a child, and chooses a new labeling for the child that minimizes conflict. We will call this function, forceLabels(t, labels, kid) where labels is the labeling of the parent:
def forceLabels(t, labels, kid): kLabels = t[kid][1].split(" ") for i in range(len(labels)): if kLabels[i] != labels[i]: s1 = set(kLabels[i]) s2 = set(labels[i]) intersect = s1.intersection(s2) if len(intersect) == 0: #No overlap, so, just choose one kLabels[i] = kLabels[i][:1] else: #Overlap, choose one to use as label: kLabels[i] = intersect.pop() t[kid][1] = " ".join(kLabels)Now, we can use that in computing the second pass:
def secondPass(t, n): #Assumes all leaves have labels #Check if there's a label for the node score = 0 labels = t[n][1].split(" ") if n == "root": for i in range(len(labels)): #For root, if more than one label, use just the first one: labels[i] = labels[i][:1] t[n][1] = " ".join(labels) if len(t[n][2]) > 0: #There's children, so, not a leaf #Assumes binary trees kid1 = t[n][2][0] kid2 = t[n][2][1] #Force the labels for the kids: x1 = forceLabels(t, labels, kid1) x2 = forceLabels(t, labels, kid2) print "# mismatches between parent & kids:", x1,x2 #Repeat for the kids secondPass(t, kid1) secondPass(t, kid2)
For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:
Target Date: 11 May 2016
Title: Lab 14: Parsimony Scores
Name & Email:
Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and intepretation of your results here.
This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.