Topics: Searching for Optimal Trees
In today's lab, we will be a naive tree search program that takes as input a DNA sequences and will search for the most parsimonious tree. There are many sophisticated programs to do this; our programs are very simple but have the same underlying concepts of the more complicated ones. We will first score trees, sampled at random in the space, and then search the space using the NNI tree rearrangement operation. For example, below are all 945 possible trees on 7 leaves scored by sequences from army ants. The optimal tree is in the middle of the image:
We will first sample the space by computing the parsimony score for 1000 random trees. Our basic approach is:
def treeSample(sequences):
The first thing needed for our treeSample() (and later the treeSearch()) function is to construct a random starting tree from the sequences. It will take a list of sequences and create a tree from it. We will use the same structure for storing trees as in the previous lab. Recall, for each node of the tree, we stored:
nodeName : [parentName, label, [child1,child2,...]]in a dictionary. As before, we will assume that there's a node named "root" in our dictionary and that all leaves (tips) have an empty list for children (since they have no children).
To build up a random tree, we will make a list of nodes that are lacking parents (parentless). While there are parent-less nodes, we will choose two at random to be children of a new node and place that new node in the list of parent-less nodes. Each time we do this we reduce the number of parent-less nodes by one. For example, assume we start with the list of species in our tree dictionary and nodes without parents as:
tree = {"ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]]} parentless = ["ape", "baboon", "chimp", "dog", "elephant", "fox"]Our random build approach is to take two at random, say "dog" and "fox", and make them children of a new node "i1":
tree = {"i1" : ["", "", ["dog", "fox"]], "ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]] }Since "dog" and "fox" now have a parent, we remove them from the parentless list, and add in "i1" since it does not have a parent:
parentless = ["ape", "baboon", "chimp", "elephant", "i1"]
Since there are still nodes without parents, we repeat the process, choosing two more at random. Say, we choose "baboon" and "i1". Then our tree dictionary and parent-less list would be:
tree = {"i2" : ["", "", ["baboon", "i1"]], "i1" : ["", "", ["dog", "fox"]], "ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]] } parentless = ["ape", "chimp", "elephant", "i2"]
We continue since there are more nodes without parents. Say, we choose at random: "ape" and "chimp", we then have:
tree = {"i3" : ["", "", ["ape", "chimp"]], "i2" : ["", "", ["baboon", "i1"]], "i1" : ["", "", ["dog", "fox"]], "ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]] } parentless = ["elephant", "i2", "i3"]And again (with random choices "i2" and "i3"):
tree = {"i4" : ["", "", ["i2", "i3"]], "i3" : ["", "", ["ape", "chimp"]], "i2" : ["", "", ["baboon", "i1"]], "i1" : ["", "", ["dog", "fox"]], "ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]] } parentless = ["elephant", "i4"]Now, since there's only two left, our "random" choice will be "elephant" and "i4". Instead of giving it an arbitrary internal node name, we will call it "root" since it is not possible for it to have a parent:
tree = {"root" : ["", "", ["elephant", "i4"]], "i4" : ["", "", ["i2", "i3"]], "i3" : ["", "", ["ape", "chimp"]], "i2" : ["", "", ["baboon", "i1"]], "i1" : ["", "", ["dog", "fox"]], "ape" : ["","",[]], "baboon" : ["","",[]], "chimp" : ["","",[]], "dog" : ["","",[]], "elefant" : ["","",[]], "fox" : ["","",[]] }
def buildRandomTree(sequenceList):
seqList = [("ape","A A A A A"), ("baboon","A A A A G"), ("chimp","C A A A A"), ("dog","T G G G G"), ("elephant","T G A C A"), ("fox","A A G G C")]
To make sure everything is working, we will add a function that convert trees into Newick format, so, we see the tree format more easily. In parsimonyScore.py, there is a convertNewick() that returns a string with the tree in Newick format. It does so, by starting at the root, adding a left parenthesis, its left child, then a comma, followed by its right child, and a right parenthesis. If the left or right child is not a leaf, it calls itself to compute the string:
def convertNewick(t,n):
Try changing this pseudocde into Python (full code in treeSearch.py) and running on a tree created from your buildRandomTree() function.
In the last lab, we built functions to compute the parsimony score of a tree with respect to sequences (available as parsimonyScore.py). We can use those functions by importing the file into our program for this lab.
import parsimonyScore as ps t = #Fill in with a tree# score = ps.scoreTree(t)
We'll use the scoring from previous lab, to figure out the scores of the new tree. If they're better than the current best score, we'll update the best score and the best tree, bestSoFar.
Now that we can build random trees and score them (from the previous lab), we can sample the treespace to find optimal trees. Let's fill in the outline from above. We start by making our outline into comments:
def treeSample(sequences): #For 1000 steps: #Score the tree. #If better score than bestScore, #choose it to be the current tree #Return best tree and score.
We need to start off our best score and tree (so we have something to compare to in the loop). We will use the parsimony scoring function from before, so, will import that file at the start:
import parsimonyScore as ps def treeSample(sequences): bestTree = buildRandomTree(sequences) bestScore = ps.scoreTree(bestTree) #For 1000 steps: for i in range(1000): #Choose randomly a tree. newTree = buildRandomTree(sequences) #Score the tree. newScore = ps.scoreTree(newTree) #We'll add in a print to see what's going on: print convertNewick(newTree, newScore) #If better score than bestScore, if newScore < bestScore: #choose it to be the current tree bestScore = newScore bestTree = newTree #Return best tree and score. return bestTree, bestScore
We can run this on our sample list of sequences to make sure it works:
#Sample list of (species, sequence) pairs: seqList = [("ape","A A A A A"), ("baboon","A A A A G"), ("chimp","C A A A A"), ("dog","T G G G G"), ("elephant","T G A C A"), ("fox","A A G G C")] #Test the tree sampling function: bestTree, bestScore = treeSample(seqList) print "\nBest from sample:", convertNewick(bestTree,"root"),":", bestScore
The idea for our search, is to start at a random tree. We score that tree to get a starting value. We then make a list of its neighbors, and choose the one that scores better than your current tree. We continue until we have visited 1000 tree neighborhoods or until you're stuck (i.e. local optima).
Recall from lecture that there are three common ways to explore treespace:
Original Tree | Nearest Neighbor Interchange | Subtree Prune & Regraph | Tree Bisection & Reconnection |
---|
Since SPR and TBR neighborhoods are larger, they are the most commonly used to explore treespace. We will use the NNI rearrangement since it is simpler to implement.
Here's our basic approach:
def treeSearch(sequences):
We're in the rooted case, so, we'll take each internal node, and exchange the place of its sibling with each of its children to give two new trees.
def nniNeighbor(t, node):
While the idea is straightforward, the actual code to do this is quite tedious, as you have to keep track of who is the parent of whom and then update all the respective parents and children. Similarly, the function that finds the optimal NNI neighbor is conceptually easy, but quite tedious to code. Both have been written for you in treeSearch.py:
For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:
Target Date: 11 May 2016
Title: Lab 15: Tree Search
Name & Email:
Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.
This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.