Lab 14, RGGS 670: Algorithmic Approaches to Biological Data, Spring 2016

Fitch's Algorithm

We will implement Fitch's algorithm for scoring trees. To simplify our implementation, we will assume that the tree is already stored in an adjacency list, following the format we used previously of each key has a value of a list of the parent, the label, and the children (if any).

Our algorithm will:

Input: tree implemented with an adjacency list and parents, keep labels at the node.
Output: the parsimony score of the tree.

Our algorithm operates in two passes:

First Pass: Starting at the leaves, label the internal leaves (with possible multiple labels).
Second Pass: Starting at the root, choose a labeling, then work towards the leaves minimizing the conflicts.

Trees will be stored as an adjacency list where the underlying structure is a dictionary. Entries of the dictionary are of the form:

		nodeName : [parentName, label, [child1,child2,...]]

If our tree is called, t, then we can access the value stored for the "root" by:

		print t["root"]

which will print out a list with the parent's name (an empty string in the case of the root), its label (will start out as an empty string for non-leaf nodes), and its children. A possible output would be:

		["","", ["i1", "a"]]

where "i1" and "a" are names of other nodes in the tree. The file smallPars.py contains a tree (as well as the code developed in this lab):

tree = {"root" : ["","",["i1","i2"]],
        "i1": ["root","", ["a", "i3"]], 
        "i2": ["root","", ["b", "c"]],
        "i3": ["i1","", ["d","e"]],
        "a": ["i1","T A A A A", []],
        "b": ["i2","A A A A G", []],
        "c": ["i2","C A A A T", []],
        "d": ["i3","A A A A A", []],
        "e": ["i3","T A A A A", []]}

Note that we're separating the labels for each basepair by space. This is to make it easier to tell which label is which when we have multiple possibilties during the passes of the algorithm.

Try printing out different nodes in the tree to make sure that there's no typos.

We're going to build up the program piece-by-piece. The first piece that needs to be built is a function that takes the labels of two siblings and returns a possible labeling for their parent. Since we will repeat this with every pair of siblings in the trees, we will take the code that labels the parent from a child and make a function. As a function, we can test it separately to make sure it works. For example, if the labels of siblings are: A T A T G and A A T T G, then the function should compare position by position and return A AT AT T G.

Let's start by creating a function that takes two inputs and returns a test message (we will alternate between testing and adding to the function):

def newLabel(l1,l2):
	 """
	 Takes two labels and returns the label of their parent.
	 """
	 
	 return "testing newLabel()"

Try calling the function several times with different inputs to make sure there are no typographical errors.

Next we need to split up the labels into separate positions and store in a list. Let's also create a new list to store the labels as we create them

     #Split up the labels into separate positions:
     b1 = l1.split(" ")
     b2 = l2.split(" ")
     #Create a new list to store the parent labels:
     newL = []

To make sure that this works, change the return statement to be

     return b1

Also try with b2. Once you've removed all the errors, go on to building the labels for the parents.

We want to compare the labels for each position, so, we'll use a for loop. If the two sequences agree at a position, then we'll add it directly to newL.

	
     for i in range(len(b1)):
          if b1[i] == b2[i]:
               newL.append(b1[i])

But what if they don't? Then the else clause takes the intersection of the corresponding sets. If the intersection is empty, the union is the label stored in newL. If the intersection is not empty, store it as the label in newL:

	
          else:
               s1 = set(b1[i])
               s2 = set(b2[i])
               print "s1:", s1
               print "s2:", s2
               intersect = s1.intersection(s2)
               if len(intersect) == 0:
                    union = s1.union(s2)
                    newL.append("".join(union))
               else:
                    newL.append("".join(intersect))

The function returns the label for the parent as a string:

     return " ".join(newL)

Try running this function on several pairs to verify that it gives the correct label for the parent.

First Pass

Now that we have a function that will produce labels for the parents of pairs of nodes, we can use it to make a first pass through the tree. Again, let's break it into pieces. This time, we'll have cases based on whether the node is a leaf or not: firstPass(t,n)

If n is a leaf of t: return it's label. Else: run firstPass on its kids Use newLabel to compute a new label for the parent and return it.

Think about how to write this, and then look at the code below:

def firstPass(t, n):
     #Assumes all leaves have labels
     #Check if there's a label for the node
     if t[n][1] != "":
          return t[n][1]
     else:
          #If not, get the label of it's kids:
          #Assumes binary trees
          kid1 = t[n][2][0]
          kid2 = t[n][2][1]
          newL = newLabel(firstPass(t,kid1),firstPass(t,kid2))
          t[n][1] = newL
          return newL

Second Pass

The pseudocode for the second pass is similar to the first pass, but we start at the root and work towards the leaves: secondPass(t,n) If n is the root of t: choose a label for each basepair from the current labeling. Else: Force a labeling on each kid (minimizing conflict with the parent). Run secondPass on its kids

Let's write the function that takes a parent (with a proper labeling) and a child, and chooses a new labeling for the child that minimizes conflict. We will call this function, forceLabels(t, labels, kid) where labels is the labeling of the parent:

def forceLabels(t, labels, kid):
     kLabels = t[kid][1].split(" ")
     for i in range(len(labels)):
          if kLabels[i] != labels[i]:
               s1 = set(kLabels[i])
               s2 = set(labels[i])
               intersect = s1.intersection(s2)
               if len(intersect) == 0:
                    #No overlap, so, just choose one
                    kLabels[i] = kLabels[i][:1]
               else:
                    #Overlap, choose one to use as label:
                    kLabels[i] = intersect.pop()
     t[kid][1] = " ".join(kLabels)

Now, we can use that in computing the second pass:

def secondPass(t, n):
     #Assumes all leaves have labels
     #Check if there's a label for the node
     score = 0
     labels = t[n][1].split(" ")
     if n == "root":
          for i in range(len(labels)):
               #For root, if more than one label, use just the first one:
               labels[i] = labels[i][:1]
          t[n][1] = " ".join(labels)
     if len(t[n][2]) > 0:
          #There's children, so, not a leaf
          #Assumes binary trees
          kid1 = t[n][2][0]
          kid2 = t[n][2][1]          
          #Force the labels for the kids:
          x1 = forceLabels(t, labels, kid1)
          x2 = forceLabels(t, labels, kid2)
          print "# mismatches between parent & kids:", x1,x2
          #Repeat for the kids
          secondPass(t, kid1)
          secondPass(t, kid2)

Challenge

The current program labels the internal nodes in a minimal fashion, but does not compute the final score of the tree. Extend your program to print out the parsimony score of the tree.

(Hint: Keep a running total of the score as you compute it in forceLabeling and in secondPass and return that number.)