Topics: Assembling Short Reads: Overlap Graphs & de Bruijn Graphs
How do you assemble a genome from reads?
Let's rephrase as a computational question:
That says the same thing but in language a computer scientist could translate into the skeleton of a function:
def assemble(k_mers): """ Takes k_mers and returns a sequence """ sequence = [] ###More stuff goes here!### return sequence
The function currently doesn't do much (just returns an empty list). Before we fill in the details, let's write some code that will get input from a file (since it will get tedious typing in the reads while we're debugging the code). The function should take a file name as an input and return a list of the reads in the file:
def getReadsFromFile(infile): f = open(infile) #Depending on the file, could have extra newlines at end, strip off: data = f.read().rstrip() return data.split("\n")
Test to make sure it's working with simpleReads.txt and textbookReads.txt (i.e. try: print getReadsFromFile("simpleReads.txt") which should return a list of the 4 reads in the file).
What's next? Here's the steps we'll need to fill in the assemble() function (based on algorithms described in Chapter 3 of the book or Compeau, Pevzner, & Tesler 2011). The function should:
We'll use adjacency lists and dictionaries to implement the de Bruijn graphs. Recall that for a de Bruijn graph:
The entries in our dictionary have the following structure:
(prefix,[(read1,suffix1,count1),(read2,suffix2,count2),...])If there are outgoing edges (that is there's a suffix that never occurs as a prefix), the last part of the tuple is an empty list.
Here is an example of a very simple graph (basically just a path):
Since some nodes have the same labels, they're distinguished by color (we'll use numbers when needed since they're easier to represent in the computer).
The labels of the vertices are the suffix/prefix of reads, so, those will be the keys to the dictionary storing the adjacency list. For each key, we need to keep information about its outgoing edges. For example, the read TAA would generate two dictionary entries:
"TA" : ["AA"] "AA" : []where TA and AA are the keys for values are (possibly empty) lists of tuples. To simplify things, we will just keep the label of the destintation node (since combining the source and destination labels gives back the read). What happens if the next read is AAT? That corresponds to an edge AA → AT. The prefix, AA is already in the dictionary. For AA, we add an outgoing edge to AT. For AT, we add a new entry with no outgoing edges to give a dictionary:
{"TA" : ["AA"], "AA" : ["AT"], "AT" : [] }What happens when we add the read ATG? It's prefix is AT, and it's suffix is TG. Since AT is already in the dictionary, we add to its existing entry. TG is not, so, we create a new entry:
{"TA" : ["AA"], "AA" : ["AT"], "AT" : ["TG"], "TG" : [] }Let's say we see the read ATG again. Since we care about how many times reads are seen, we'll add it in as a separate entry (unlike most graphs where we only allow one edge between any pair of vertices):
{"TA" : ["AA"], "AA" : ["AT"], "AT" : ["TG","TG"], "TG" : [] }Why do this? Each of these edges gives a possible "path" through the completed sequence. Keeping these entries separate will make finding complete sequences easier.
Rewriting the steps we did to create the new entries (and adding a few to set things up properly):
createDeBruijnGraph(reads):
Now that it's in pseudocode, it's straightforward to translate this into Python. First, copy the outline into an editor, and try to fill in as many steps as possible before checking the code in deBruijn.py.
A cycle that traverses each edge of a graph exactly once is called an Eulerian cycle, and we say that a graph containing such a cycle is Eulerian. When graphs are balanced (all nodes have the same number of incoming and outgoing edges), then there's an Eulerian cycle. Our graphs are almost balanced: the starting and the ending nodes are not balanced. This can be solved by just connecting them. It's a bit tedious but the idea is to find the node with no outgoing edges and then find the one with no incoming edges, and link them (see balancedGraph() in deBruijn.py for details).
We've completed the hard work of setting up the (balanced) de Bruijn graph. Let's use it to find possibilities for the reconstructed sequence. The following algorithm constructs an Eulerian cycle in an arbitrary directed graph:
eulerianCycle(graph)
Form a cycle Cycle by randomly walking in Graph (don't visit the same edge twice!)
We already have the graph structure set up, so, let's make a copy of it, unexplored, that we can use to cross off edges when we crossed them.
Our algorithm is going to start at a random place and keep walking until stuck. For example, say we have traversed the green highlighted edges as numbered below:
When Leo the ant (yes, that's what the book calls him) reaches the blue node, he can follow edge 5 (only unvisited edge leaving the blue node) and then go on to edge If it's visited every edge, then we're done. But what if it hasn't:
Instead of tossing out all the work we've done and starting from scratch, let's build on what we have, namely let's keep the green cycle above and find some node in the cycle that has an unvisited edge:
We can restart the walk at the red node above and visit a new cycle and then glue it together with our previous work (the cycle colored in green). We can keep repeating this process until the all the edges are visiting.
Let's write the code for this. First set up a function that takes a graph as input and returns an empty list for the cycle:
def eulerianCycle(graph): cycle = [] return cycleTest this to make sure there's no typos, and then let's add in more. We need a copy of the graph that we can modify to keep track of the edges left to explore. To make a copy of a dictionary, use the copy() function:
def eulerianCycle(graph): """ Form a cycle Cycle by randomly walking in Graph (don't visit the same edge twice!) """ #Put all edges into the unexplored edges: unexplored = graph.copy() cycle = [] return cycleTest that, and let's choose a random graph to start the cycle. For dictionaries, we can remove something from the dictionary with popitem() (if a node has multiple neighbors, we'll put the ones we don't use back).
def eulerianCycle(graph): """ Form a cycle Cycle by randomly walking in Graph (don't visit the same edge twice!) """ #Put all edges into the unexplored edges: unexplored = graph.copy() #Grab an edge from graph to start off the cycle: key, value = unexplored.popitem() #Use that as the start of our cycle: cycle = [key,value[0]] #Add back to the dictionary if there's > 1 outgoing edges if len(value) > 1: unexplored[key] = value[1:] print "The graph:", graph print "The unexplored edges:, unexplored print "The cycle", cycle return cycleSince dictionaries are stored in arbitrary order, different runs of the program might give you different edges chosen for your cycle. Test and make sure you get the unexplored graph has one less edge (namely the one in cycle) than graph.
How do we extend the cycle? Let's look again at Figure 4.23:
The next edge extends out from the last node in the cycle. For us, that's stored in cycle[-1]. So, let's check if it has any outgoing edges, and if so, use it to extend the cycle (after making sure one step forward works, we'll add it to a loop to repeat it):
#Check if you can go extend the cycle: if cycle[-1] in unexplored: neighbors = unexplored.pop(cycle[-1]) if len(neighbors) > 0: if len(neighbors) > 1: #Put back the remaining unvisited edges: unexplored[cycle[-1]] = neighbors[1:] #Add to cycle cycle.append(neighbors[0])
What if the last node in the cycle has no unexplored outgoing edges? Then, we need to look at other nodes in the cycle and find one that does:
In code, this would be:
#Select a node i in cycle with still unexplored edges. else: for i in range(len(cycle)): print i, cycle if cycle[i] in unexplored: neighbors = unexplored.pop(cycle[i]) if len(neighbors) > 0: if len(neighbors) > 1: #Put back the remaining unvisited edges: unexplored[cycle[-1]] = neighbors[1:] #Reorder cycle to put i at the end: cycle = cycle[:i] + cycle[i:] #Add to cycle cycle.append(neighbors[0]) break
That's all the pieces, we now need to put the if-else statement that looks for the next node in a while loop:
#While there are unexplored edges in graph: while unexplored: #Check if you can go extend the cycle: if cycle[-1] in unexplored: ... #Select a node i in cycle with still unexplored edges. else: ...Inside the loop, we're removing an edge from unexplored every iteration, so, the structure is getting smaller and eventually while unexplored will be false (so it will stop!).
Here's all the pieces together (also in deBruijn.py)
def eulerianCycle(graph): """ Form a cycle Cycle by randomly walking in Graph (don't visit the same edge twice!) """ #Put all edges into the unexplored edges: unexplored = graph.copy() #Grab an edge from graph to start off the cycle: key, value = unexplored.popitem() #Use that as the start of our cycle: cycle = [key,value[0]] #Add back to the dictionary if there's > 1 outgoing edges if len(value) > 1: unexplored[key] = value[1:] #While there are unexplored edges in graph: while unexplored: #Check if you can go extend the cycle: if cycle[-1] in unexplored: neighbors = unexplored.pop(cycle[-1]) if len(neighbors) > 0: if len(neighbors) > 1: #Put back the remaining unvisited edges: unexplored[cycle[-1]] = neighbors[1:] #Add to cycle cycle.append(neighbors[0]) #Select a node i in cycle with still unexplored edges. else: for i in range(len(cycle)): print i, cycle if cycle[i] in unexplored: neighbors = unexplored.pop(cycle[i]) print "neighbors", neighbors if len(neighbors) > 0: #Reorder cycle to put i at the end: cycle = cycle[:i] + cycle[i:] if len(neighbors) > 1: #Put back the remaining unvisited edges: unexplored[cycle[-1]] = neighbors[1:] #Add to cycle cycle.append(neighbors[0]) break return cycle
Try running the code on several different sets of reads to make sure it works.
Adding in the functions we wrote, we have a program that takes reads and prints out the cycle (still to do: convert the cycle into a finished sequence):
def assemble(k_mers): """ Takes k_mers and returns a sequence """ g = createDeBruijnGraph(reads) print "The graph is: ", g balanceGraph(g) print "The balanced graph is: ", g print "\n\nBuilding up the Eulerian cycle" c = eulerianCycle(g) print "\n\nFound the cycle:", c #Missing step: convert the cycle c into the sequence: return c
Composition(3,"TATGGGGTGC") = {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG}Note that the k-mers can appear in any order.
Generate the k-mer composition of a string.
For example, with the data:
ACCGA CCGAA CGAAG GAAGC AAGCTThe output should be:
ACCGAAGCT
For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:
Target Date: 20 April 2016
Title: Lab 11: Assembling Reads
Name & Email:
Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.
This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.