Topics: Useful string methods, Counting Bases, Regular Expressions, Finding motifs in sequences.
In lecture, we introduced some very useful methods for strings:
Since C and G form 3 hydrogen bonds while A and T form 2 hydrogen bonds, the stronger C-G bonds will occur more frequently in thermophiles (organisms that thrive in extremely hot environments).This is a much debated question (see Zeldovich et al., 2007 and Li et al., 2014) whose answer varies depending on the region. Since the whole genome files are quite large, let's run a small test using FASTA sequence files for small regions of 2 thermophiles and 2 mesophiles (downloaded from EnsemblBacteria and stored as a zip file).
Regular expresssions are extremely powerful in finding patterns that vary in length. For example, to find the first AT repeat of longer than 4 repeats, we can use a regular expression:
import re dna = "ACTGCATTATATCGTACGAAAGCTGCTTATACGCGCG" runs = re.findall("[AT]{4,100}", dna) print runsTo find the location of a pattern in a string, we can use:
if re.search(r"GC[ATGC]GC", dna): print "restriction site found!"(more examples from Python for Biologists).
The re library is distributed with Python. We will use two useful functions in the library:
We often want more information than just if a pattern occurred or in what way. To find out the starting (and stopping) location, we can use the match object that re.search() returns. It's most useful functions are:
From our example above, we could store the match object:
m = re.search(r"GC[ATGC]GC, dna) print "The matching string is", m.group() print "Match starts at", m.start() print "Match ends at", m.end()
These are more general (and more powerful) tools than the string methods above. In many cases, either can be used. For finding approximate matches or matches of varying lengths, using regular expressions is much easier. Today's lab has some simple uses of regular expressions. We will go into more depth when we dive into the genome assembly in the second half of the course.
#NEXUS BEGIN TAXA; TAXLABELS A B C; END; BEGIN TREES; TREE tree1 = ((A,B),C); TREE tree2 = (A,(B,C)); TREE tree3 = ((A,C),A); END;
Write a program that takes as input a nexus file and prints to the screen all trees contained in the file. Your program should print the names of each trees, followed by the equals sign and the tree itself. You should not print the preceding "TREE" or ending semi-colon (although both are very useful for locating trees in the file). For example, given the nexus file above, your program would print:
Welcome to my tree-hunting program! Please enter the name of your nexus file: sample.nex Trees in your file: tree1 = ((A,B),C) tree2 = (A,(B,C)) tree3 = ((A,C),A)
Hints:
Note: you can implement this program with either regular expressions or find()-- use whichever makes more sense to you.
For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:
Target Date: 22 February 2016
Title: Lab 4:
Name & Email:
Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.
This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.