Lab 15, RGGS 670: Algorithmic Approaches to Biological Data, Spring 2016

Sampling Treespace

In today's lab, we will be a naive tree search program that takes as input a DNA sequences and will search for the most parsimonious tree. There are many sophisticated programs to do this; our programs are very simple but have the same underlying concepts of the more complicated ones. We will first score trees, sampled at random in the space, and then search the space using the NNI tree rearrangement operation. For example, below are all 945 possible trees on 7 leaves scored by sequences from army ants. The optimal tree is in the middle of the image:

We will first sample the space by computing the parsimony score for 1000 random trees. Our basic approach is:

def treeSample(sequences):

Input: A set of n sequences of length k
Output: The best scoring tree we could find after sampling 1000 trees.

For 1000 steps:
Choose randomly a tree.
Score the tree.
If better score than bestScore,
choose it to be the current tree
Return bestTree and score.

Random Starting Trees

The first thing needed for our treeSample() (and later the treeSearch()) function is to construct a random starting tree from the sequences. It will take a list of sequences and create a tree from it. We will use the same structure for storing trees as in the previous lab. Recall, for each node of the tree, we stored:

		nodeName : [parentName, label, [child1,child2,...]]

in a dictionary. As before, we will assume that there's a node named "root" in our dictionary and that all leaves (tips) have an empty list for children (since they have no children).

To build up a random tree, we will make a list of nodes that are lacking parents (parentless). While there are parent-less nodes, we will choose two at random to be children of a new node and place that new node in the list of parent-less nodes. Each time we do this we reduce the number of parent-less nodes by one. For example, assume we start with the list of species in our tree dictionary and nodes without parents as:

	tree = {"ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]]}
	parentless = ["ape", "baboon", "chimp", "dog", "elephant", "fox"]

Our random build approach is to take two at random, say "dog" and "fox", and make them children of a new node "i1":

	tree = {"i1" : ["", "", ["dog", "fox"]],
	        "ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]] }

Since "dog" and "fox" now have a parent, we remove them from the parentless list, and add in "i1" since it does not have a parent:

	parentless = ["ape", "baboon", "chimp", "elephant", "i1"]

Since there are still nodes without parents, we repeat the process, choosing two more at random. Say, we choose "baboon" and "i1". Then our tree dictionary and parent-less list would be:

	tree = {"i2" : ["", "", ["baboon", "i1"]],
	        "i1" : ["", "", ["dog", "fox"]],
	        "ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]] }
	parentless = ["ape", "chimp", "elephant", "i2"]

We continue since there are more nodes without parents. Say, we choose at random: "ape" and "chimp", we then have:

	tree = {"i3" : ["", "", ["ape", "chimp"]],
	        "i2" : ["", "", ["baboon", "i1"]],
	        "i1" : ["", "", ["dog", "fox"]],
	        "ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]] }
	parentless = ["elephant", "i2", "i3"]

And again (with random choices "i2" and "i3"):

	tree = {"i4" : ["", "", ["i2", "i3"]],
	        "i3" : ["", "", ["ape", "chimp"]],
	        "i2" : ["", "", ["baboon", "i1"]],
	        "i1" : ["", "", ["dog", "fox"]],
	        "ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]] }
	parentless = ["elephant", "i4"]

Now, since there's only two left, our "random" choice will be "elephant" and "i4". Instead of giving it an arbitrary internal node name, we will call it "root" since it is not possible for it to have a parent:

	tree = {"root" : ["", "", ["elephant", "i4"]],
	        "i4" : ["", "", ["i2", "i3"]],
	        "i3" : ["", "", ["ape", "chimp"]],
	        "i2" : ["", "", ["baboon", "i1"]],
	        "i1" : ["", "", ["dog", "fox"]],
	        "ape" : ["","",[]], 
	        "baboon" : ["","",[]], 
	        "chimp" : ["","",[]], 
	        "dog" : ["","",[]], 
	        "elefant" : ["","",[]], 
	        "fox" : ["","",[]] }

def buildRandomTree(sequenceList):

Input: A set of n sequences of length k.
Output: A tree dictionary with the sequences as the leaves.

Put all leaves into a dictionary.
Also create a "To Do" list, parentless, of nodes without parents.
While len(parentless) > 2:
Make a new internal node and choose two from parentless to be its children.
Remove children from parentless.
Add new node to parentless.
Join the remaining nodes in parentless to a new node called root.

Let's implement this in Python. Copy the outline to an editor, and make each of these lines into a comment. Next, fill in the easiest parts (one possible way to do this in treeSearch.py). Here is a very simple list of sequences to use for testing your function:

seqList = [("ape","A A A A A"),
           ("baboon","A A A A G"),
           ("chimp","C A A A A"),
           ("dog","T G G G G"),
           ("elephant","T G A C A"),
           ("fox","A A G G C")]

Trees in Newick Format

To make sure everything is working, we will add a function that convert trees into Newick format, so, we see the tree format more easily. In parsimonyScore.py, there is a convertNewick() that returns a string with the tree in Newick format. It does so, by starting at the root, adding a left parenthesis, its left child, then a comma, followed by its right child, and a right parenthesis. If the left or right child is not a leaf, it calls itself to compute the string:

def convertNewick(t,n):

Input: A tree dictionary, t, and node, n.
Output: A string comtaining the subtree at node n in Newick format.

If n is a leaf of t:
return n
Else:
left = convertNewick(t, left child of n)
right = convertNewick(t, right child of n)
s = "("+left+","+right+")"
return s

Try changing this pseudocde into Python (full code in treeSearch.py) and running on a tree created from your buildRandomTree() function.

Scoring Trees

In the last lab, we built functions to compute the parsimony score of a tree with respect to sequences (available as parsimonyScore.py). We can use those functions by importing the file into our program for this lab.

import parsimonyScore as ps

t = #Fill in with a tree#
score = ps.scoreTree(t)

We'll use the scoring from previous lab, to figure out the scores of the new tree. If they're better than the current best score, we'll update the best score and the best tree, bestSoFar.

Putting It Altogether

Now that we can build random trees and score them (from the previous lab), we can sample the treespace to find optimal trees. Let's fill in the outline from above. We start by making our outline into comments:

def treeSample(sequences):
    #For 1000 steps:
	    #Score the tree.
	    #If better score than bestScore, 
	         #choose it to be the current tree
    #Return best tree and score.

We need to start off our best score and tree (so we have something to compare to in the loop). We will use the parsimony scoring function from before, so, will import that file at the start:

import parsimonyScore as ps

def treeSample(sequences):
    bestTree = buildRandomTree(sequences)
    bestScore = ps.scoreTree(bestTree)
    #For 1000 steps:
    for i in range(1000):
        #Choose randomly a tree.
        newTree = buildRandomTree(sequences)
        #Score the tree.        
        newScore = ps.scoreTree(newTree)
        
        #We'll add in a print to see what's going on:
	print convertNewick(newTree, newScore)
		
	#If better score than bestScore, 
	if newScore < bestScore:
	    #choose it to be the current tree
	    bestScore = newScore
	    bestTree = newTree
    #Return best tree and score.
    return bestTree, bestScore

We can run this on our sample list of sequences to make sure it works:

    #Sample list of (species, sequence) pairs:    
    seqList = [("ape","A A A A A"),
               ("baboon","A A A A G"),
               ("chimp","C A A A A"),
               ("dog","T G G G G"),
               ("elephant","T G A C A"),
               ("fox","A A G G C")]
    
    #Test the tree sampling function:
    bestTree, bestScore = treeSample(seqList)
    print "\nBest from sample:", convertNewick(bestTree,"root"),":", bestScore

Searching Treespace

The idea for our search, is to start at a random tree. We score that tree to get a starting value. We then make a list of its neighbors, and choose the one that scores better than your current tree. We continue until we have visited 1000 tree neighborhoods or until you're stuck (i.e. local optima).

Recall from lecture that there are three common ways to explore treespace:

Original Tree Nearest Neighbor Interchange Subtree Prune & Regraph Tree Bisection & Reconnection


Original Tree	Nearest Neighbor Interchange	Subtree Prune & Regraph	Tree Bisection & Reconnection

Since SPR and TBR neighborhoods are larger, they are the most commonly used to explore treespace. We will use the NNI rearrangement since it is simpler to implement.

Here's our basic approach:

def treeSearch(sequences):

Input: A set of n sequences of length k
Output: The best scoring tree we could find after 1000 steps through the NNI search space.

Choose randomly a tree to be bestSoFar.
Score the tree, bestSoFar.
For 1000 steps:
Make a list of all the NNI neighbors of bestSoFar
if any of the NNI neighbors of bestSoFar have better score,
choose it to be the current tree
else:
break (stuck in local minima)
Print number of steps, tree, and score.

Let's fill in the pieces (making functions that we can call for the tasks) and then we will return to put all this together to make a running search program.

Neighbors of Trees

We're in the rooted case, so, we'll take each internal node, and exchange the place of its sibling with each of its children to give two new trees.

def nniNeighbor(t, node):

Input: A tree, t, and an internal node, node.
Output: Two trees that result from switching the nodes' sibling with its children.

Let sib be the sibling of node.
Let kid1, kid2 be the children of node.
Make two (deep) copies of t: tree1 and tree2.
Switch sib and kid1 in tree1.
Switch sib and kid2 in tree2.
Return tree1 and tree2.

While the idea is straightforward, the actual code to do this is quite tedious, as you have to keep track of who is the parent of whom and then update all the respective parents and children. Similarly, the function that finds the optimal NNI neighbor is conceptually easy, but quite tedious to code. Both have been written for you in treeSearch.py:

nniNeighbor(t,node): returns the two trees that neighbor t around node.
maxNeighbor(t): takes a tree and returns the best scoring tree among its NNI neighbors.

Challenges

Try the treeSample() program on the following data sets:
- flat.py: 10 random builds of your on a set of identical sequences that has a flat landscape. You can import the file (or copy over to your program).
  Note: all trees will have the exact same score, so all are valid optima.
- comp.py: a set of compatible sequences that has a very smooth landscape.
- rugged.py: a set of sequences that has a more rugged landscape.
(Optional:) Implement the treeSearch() function from the outline above, using the functions written in this lab.
(Optional:) Test your treeSearch() on the following datasets:
- flat.py
- comp.py.
  Note: the search should stop fairly quickly. There is exactly one global optimum. It has the possibility of being trapped on "plateaus" (since we're using NNI tree rearrangements) but more likely that most of the 10 runs will reach the global optima.
- rugged.py.
  Note: there are several local optima, and it is likely that the 10 runs will reach the different optima.