Topics: Useful string methods, Counting Bases, Regular Expressions, Finding motifs in sequences.

Useful String Methods

In lecture, we introduced some very useful methods for strings:

Does Optimal Living Temperture Affect GC Content?

Let's use these methods to address the following the hypothesis:
Since C and G form 3 hydrogen bonds while A and T form 2 hydrogen bonds, the stronger C-G bonds will occur more frequently in thermophiles (organisms that thrive in extremely hot environments).
This is a much debated question (see Zeldovich et al., 2007 and Li et al., 2014) whose answer varies depending on the region. Since the whole genome files are quite large, let's run a small test using FASTA sequence files for small regions of 2 thermophiles and 2 mesophiles (downloaded from EnsemblBacteria and stored as a zip file).

Regular Expressions

Regular expresssions are extremely powerful in finding patterns that vary in length. For example, to find the first AT repeat of longer than 4 repeats, we can use a regular expression:

import re
dna = "ACTGCATTATATCGTACGAAAGCTGCTTATACGCGCG" 
runs = re.findall("[AT]{4,100}", dna) 
print runs
To find the location of a pattern in a string, we can use:
if re.search(r"GC[ATGC]GC", dna):
    print "restriction site found!"
(more examples from Python for Biologists).

The re library is distributed with Python. We will use two useful functions in the library:

We often want more information than just if a pattern occurred or in what way. To find out the starting (and stopping) location, we can use the match object that re.search() returns. It's most useful functions are:

(see Python regex tutorial for more details).

From our example above, we could store the match object:

m = re.search(r"GC[ATGC]GC, dna)
print "The matching string is", m.group()
print "Match starts at", m.start()
print "Match ends at", m.end()

These are more general (and more powerful) tools than the string methods above. In many cases, either can be used. For finding approximate matches or matches of varying lengths, using regular expressions is much easier. Today's lab has some simple uses of regular expressions. We will go into more depth when we dive into the genome assembly in the second half of the course.

Seeing the Trees in the Nexus File

Use regular expressions to find all trees in a nexus file. Nexus files have a very strict format, and you can assume that all trees are preceded by the word TREE, followed by their name and an equals sign ('='). The tree, in Newick format, is given, followed by a semi-colon (;). Trees can span multiple lines. A sample file:
#NEXUS
BEGIN TAXA;
  TAXLABELS A B C;
END;

BEGIN TREES;
  TREE tree1 = ((A,B),C);
  TREE tree2 = (A,(B,C));
  TREE tree3 = ((A,C),A);
END;

Write a program that takes as input a nexus file and prints to the screen all trees contained in the file. Your program should print the names of each trees, followed by the equals sign and the tree itself. You should not print the preceding "TREE" or ending semi-colon (although both are very useful for locating trees in the file). For example, given the nexus file above, your program would print:

Welcome to my tree-hunting program!
Please enter the name of your nexus file:  sample.nex

Trees in your file:
tree1 = ((A,B),C)
tree2 = (A,(B,C))
tree3 = ((A,C),A)

Hints:

  1. Write down an outline of what your program will do.
  2. Next, think through what it will do on several examples. Will it work on the example above?
  3. What happens if there are multiple trees in the file? Will it find them all? How does it go on to the next?
  4. Write the program in small pieces: first, read in the file and just print it back out (to make sure the reading is working correctly).
  5. Next, add in the other pieces you've sketched out, testing each part as you go (it sounds like it will take longer, but it's quicker to eliminate bugs one-by-one than a whole swarm of them).

Note: you can implement this program with either regular expressions or find()-- use whichever makes more sense to you.

Lab Report

For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:

Target Date: 22 February 2016
Title: Lab 4:
Name & Email:

Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.

Using Rosalind

This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.