Topics: Implementing Fitch's Algorithm for the Small Parsimony Problem

Fitch's Algorithm

We will implement Fitch's algorithm for scoring trees. To simplify our implementation, we will assume that the tree is already stored in an adjacency list, following the format we used previously of each key has a value of a list of the parent, the label, and the children (if any).

Our algorithm will:

Our algorithm operates in two passes:

Trees will be stored as an adjacency list where the underlying structure is a dictionary. Entries of the dictionary are of the form:

		nodeName : [parentName, label, [child1,child2,...]]

If our tree is called, t, then we can access the value stored for the "root" by:

		print t["root"]
which will print out a list with the parent's name (an empty string in the case of the root), its label (will start out as an empty string for non-leaf nodes), and its children. A possible output would be:
		["","", ["i1", "a"]]
where "i1" and "a" are names of other nodes in the tree. The file contains a tree (as well as the code developed in this lab):
tree = {"root" : ["","",["i1","i2"]],
        "i1": ["root","", ["a", "i3"]], 
        "i2": ["root","", ["b", "c"]],
        "i3": ["i1","", ["d","e"]],
        "a": ["i1","T A A A A", []],
        "b": ["i2","A A A A G", []],
        "c": ["i2","C A A A T", []],
        "d": ["i3","A A A A A", []],
        "e": ["i3","T A A A A", []]}

Note that we're separating the labels for each basepair by space. This is to make it easier to tell which label is which when we have multiple possibilties during the passes of the algorithm.

Try printing out different nodes in the tree to make sure that there's no typos.

We're going to build up the program piece-by-piece. The first piece that needs to be built is a function that takes the labels of two siblings and returns a possible labeling for their parent. Since we will repeat this with every pair of siblings in the trees, we will take the code that labels the parent from a child and make a function. As a function, we can test it separately to make sure it works. For example, if the labels of siblings are: A T A T G and A A T T G, then the function should compare position by position and return A AT AT T G.

Let's start by creating a function that takes two inputs and returns a test message (we will alternate between testing and adding to the function):

def newLabel(l1,l2):
	 Takes two labels and returns the label of their parent.
	 return "testing newLabel()"

Try calling the function several times with different inputs to make sure there are no typographical errors.

Next we need to split up the labels into separate positions and store in a list. Let's also create a new list to store the labels as we create them

     #Split up the labels into separate positions:
     b1 = l1.split(" ")
     b2 = l2.split(" ")
     #Create a new list to store the parent labels:
     newL = []

To make sure that this works, change the return statement to be

     return b1
Also try with b2. Once you've removed all the errors, go on to building the labels for the parents.

We want to compare the labels for each position, so, we'll use a for loop. If the two sequences agree at a position, then we'll add it directly to newL.

     for i in range(len(b1)):
          if b1[i] == b2[i]:
But what if they don't? Then the else clause takes the intersection of the corresponding sets. If the intersection is empty, the union is the label stored in newL. If the intersection is not empty, store it as the label in newL:
               s1 = set(b1[i])
               s2 = set(b2[i])
               print "s1:", s1
               print "s2:", s2
               intersect = s1.intersection(s2)
               if len(intersect) == 0:
                    union = s1.union(s2)
The function returns the label for the parent as a string:
     return " ".join(newL)

Try running this function on several pairs to verify that it gives the correct label for the parent.

First Pass

Now that we have a function that will produce labels for the parents of pairs of nodes, we can use it to make a first pass through the tree. Again, let's break it into pieces. This time, we'll have cases based on whether the node is a leaf or not: firstPass(t,n)

  1. If n is a leaf of t: return it's label.
  2. Else: run firstPass on its kids
  3. Use newLabel to compute a new label for the parent and return it.
Think about how to write this, and then look at the code below:
def firstPass(t, n):
     #Assumes all leaves have labels
     #Check if there's a label for the node
     if t[n][1] != "":
          return t[n][1]
          #If not, get the label of it's kids:
          #Assumes binary trees
          kid1 = t[n][2][0]
          kid2 = t[n][2][1]
          newL = newLabel(firstPass(t,kid1),firstPass(t,kid2))
          t[n][1] = newL
          return newL

Second Pass

The pseudocode for the second pass is similar to the first pass, but we start at the root and work towards the leaves: secondPass(t,n)
  1. If n is the root of t:
  2.     choose a label for each basepair from the current labeling.
  3. Else:
  4.     Force a labeling on each kid (minimizing conflict with the parent).
  5.     Run secondPass on its kids

Let's write the function that takes a parent (with a proper labeling) and a child, and chooses a new labeling for the child that minimizes conflict. We will call this function, forceLabels(t, labels, kid) where labels is the labeling of the parent:

def forceLabels(t, labels, kid):
     kLabels = t[kid][1].split(" ")
     for i in range(len(labels)):
          if kLabels[i] != labels[i]:
               s1 = set(kLabels[i])
               s2 = set(labels[i])
               intersect = s1.intersection(s2)
               if len(intersect) == 0:
                    #No overlap, so, just choose one
                    kLabels[i] = kLabels[i][:1]
                    #Overlap, choose one to use as label:
                    kLabels[i] = intersect.pop()
     t[kid][1] = " ".join(kLabels)
Now, we can use that in computing the second pass:
def secondPass(t, n):
     #Assumes all leaves have labels
     #Check if there's a label for the node
     score = 0
     labels = t[n][1].split(" ")
     if n == "root":
          for i in range(len(labels)):
               #For root, if more than one label, use just the first one:
               labels[i] = labels[i][:1]
          t[n][1] = " ".join(labels)
     if len(t[n][2]) > 0:
          #There's children, so, not a leaf
          #Assumes binary trees
          kid1 = t[n][2][0]
          kid2 = t[n][2][1]          
          #Force the labels for the kids:
          x1 = forceLabels(t, labels, kid1)
          x2 = forceLabels(t, labels, kid2)
          print "# mismatches between parent & kids:", x1,x2
          #Repeat for the kids
          secondPass(t, kid1)
          secondPass(t, kid2)


  • The current program labels the internal nodes in a minimal fashion, but does not compute the final score of the tree. Extend your program to print out the parsimony score of the tree.

    (Hint: Keep a running total of the score as you compute it in forceLabeling and in secondPass and return that number.)
  • Lab Report

    For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:

    Target Date: 11 May 2016
    Title: Lab 14: Parsimony Scores
    Name & Email:

    Purpose: Give summary of what was done in this lab.
    Procedure: Describe step-by-step what you did (include programs or program outlines).
    Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
    Discussion: Give a short explanation and intepretation of your results here.

    Using Rosalind

    This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.