Lab 4, RGGS 670: Algorithmic Approaches to Biological Data, Spring 2016

Useful String Methods

In lecture, we introduced some very useful methods for strings:

s.count(pattern): returns the number of times the pattern string occurs in string s.
s.find(pattern): returns the location (index) of the first place the pattern string occurs in string s.
s.replace(old,new): replaces all occurrences of the old string with the new string.
indices: s[i] is the ith element of the string s
slicing: s[start, stop, step] is the substring of s beginning at index start and going up by step upto but not including stop.
s[-1::-1] or s[::-1] returns the reverse of the string.

Does Optimal Living Temperture Affect GC Content?

Let's use these methods to address the following the hypothesis:

Since C and G form 3 hydrogen bonds while A and T form 2 hydrogen bonds, the stronger C-G bonds will occur more frequently in thermophiles (organisms that thrive in extremely hot environments).

This is a much debated question (see Zeldovich et al., 2007 and Li et al., 2014) whose answer varies depending on the region. Since the whole genome files are quite large, let's run a small test using FASTA sequence files for small regions of 2 thermophiles and 2 mesophiles (downloaded from EnsemblBacteria and stored as a zip file).

Download and open the zip file. There should be 4 FASTA sequence files inside.
Open up one of the FASTA files and look at the format. How can you tell where the sequence is versus comments or labels? Is the sequence on a single line or multiple lines?
Write a program that will content the percentage of base pairs that are either a 'G' or a 'C' in each file. (Hint: it will be a very short program!).
How would you modify your program to handle multiple sequences in one file (see Rosalind, #8)?

Regular Expressions

Regular expresssions are extremely powerful in finding patterns that vary in length. For example, to find the first AT repeat of longer than 4 repeats, we can use a regular expression:

import re
dna = "ACTGCATTATATCGTACGAAAGCTGCTTATACGCGCG" 
runs = re.findall("[AT]{4,100}", dna) 
print runs

To find the location of a pattern in a string, we can use:

if re.search(r"GC[ATGC]GC", dna):
    print "restriction site found!"

(more examples from Python for Biologists).

The re library is distributed with Python. We will use two useful functions in the library:

re.search(pattern, string) returns information if the pattern occurs in the string (otherwise returns None-- so can be used in an if statement). Often use a "r" before the pattern to indicate that you want the "raw" string (i.e. don't translate the (e.g.'\n'), but keep them as their raw characters).
re.findall(pattern, string) finds all occurrences of pattern in the string. It returns a list of the matching patterns.

We often want more information than just if a pattern occurred or in what way. To find out the starting (and stopping) location, we can use the match object that re.search() returns. It's most useful functions are:

group(): returns the string matched by the regular expression
start(): returns the starting position of the match
end(): returns the ending position of the match

(see Python regex tutorial for more details).

From our example above, we could store the match object:

m = re.search(r"GC[ATGC]GC, dna)
print "The matching string is", m.group()
print "Match starts at", m.start()
print "Match ends at", m.end()

These are more general (and more powerful) tools than the string methods above. In many cases, either can be used. For finding approximate matches or matches of varying lengths, using regular expressions is much easier. Today's lab has some simple uses of regular expressions. We will go into more depth when we dive into the genome assembly in the second half of the course.

Seeing the Trees in the Nexus File

Use regular expressions to find all trees in a nexus file. Nexus files have a very strict format, and you can assume that all trees are preceded by the word TREE, followed by their name and an equals sign ('='). The tree, in Newick format, is given, followed by a semi-colon (;). Trees can span multiple lines. A sample file:

#NEXUS
BEGIN TAXA;
  TAXLABELS A B C;
END;

BEGIN TREES;
  TREE tree1 = ((A,B),C);
  TREE tree2 = (A,(B,C));
  TREE tree3 = ((A,C),A);
END;

Write a program that takes as input a nexus file and prints to the screen all trees contained in the file. Your program should print the names of each trees, followed by the equals sign and the tree itself. You should not print the preceding "TREE" or ending semi-colon (although both are very useful for locating trees in the file). For example, given the nexus file above, your program would print:

Welcome to my tree-hunting program!
Please enter the name of your nexus file:  sample.nex

Trees in your file:
tree1 = ((A,B),C)
tree2 = (A,(B,C))
tree3 = ((A,C),A)

Hints:

Write down an outline of what your program will do.
Next, think through what it will do on several examples. Will it work on the example above?
What happens if there are multiple trees in the file? Will it find them all? How does it go on to the next?
Write the program in small pieces: first, read in the file and just print it back out (to make sure the reading is working correctly).
Next, add in the other pieces you've sketched out, testing each part as you go (it sounds like it will take longer, but it's quicker to eliminate bugs one-by-one than a whole swarm of them).

Note: you can implement this program with either regular expressions or find()-- use whichever makes more sense to you.

Lab Report

For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:

Target Date: 22 February 2016
Title: Lab 4:
Name & Email:

Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.