Lab 6, RGGS 670: Algorithmic Approaches to Biological Data, Spring 2016

Graphing Mathematical Functions:

Last week, we used the basemap module of matplotlib. Today's lab focuses on a more common package, pyplot and some additional ways to map data using basemap The pyplot module of matplotlib provides lots of useful ways to plot data to the screen. Let's use it to answer the question, which grows faster:

y = log(x) or y = √ x ?

To test out this question, we will write a program that:

Uses the math and plotting libraries.
Sets up a list of numbers (x-values) for our functions.
Computes the y-values of our numbers for our functions.
Creates plots of the two functions.
Shows the plots in a separate graphics window.

Let's add in the Python code that for each of these steps:

Uses the math and plotting libraries.
```
import math	
import matplotlib.pyplot as plt
	
```
Since it's unwieldly to type "matplotlib.pyplot" before every function we'd like to use from that library, instead we'll use the common abbreviation of "plt". With this, we can plt.plot(), instead of matplotlib.pyplot.plot().
Sets up a list of numbers (x-values) for our functions.
```
x = range(1,101)
```
Remember: Python starts counting at 0 and goes up to, but not including the 101. So, this creates the list [1,2,...,100].
Computes the y-values of our numbers for our functions.
```
y1 = []
for i in x:
   y = math.log(i)
   y1.append(y)
y2 = []
for i in x:
    y = math.sqrt(i)
    y2.append(y)   
```
We need two separate lists since we have two separate functions to graph.
How could you rewrite the above using list comprehensions?
Creates plots of the two functions.
```
plt.plot(x,y1,label='y1 = log(x)')
plt.plot(x,y2,label='y2 = sqrt(x)')
plt.legend()
```
Creates the plot for safe keeping but does not display it until told to (see next lines).
Shows the plots in a separate graphics window.
```
plt.show()
```
This line pops up the new graphics window to display the plots.

From your plots, which do you think grows faster: log(x) or sqrt(x)?

The `numpy` Module

The numpy is a module for numerical analysis and is part of scipy. It is distributed with anaconda, so, is now on your machines! It's main focus is on linear algebra (manipulating of vectors and matrices) and also has differential equation solvers (i.e. it will integrate functions for you, numerically).

It is commonly used with the abbreviation, np:

import numpy as np
import matplotlib.pyplot as plt

Numpy's main object is a homogeneous multidimensional array. That is, sequences of similar objects in most any dimension. You can access the elements in the arrays by tuples of positive numbers. For example, if you have a 10 x 10 grid of numbers, it could be stored in a 2-dimensional array, and each element could be accessed by specifying the row and column using a tuple: (1,2) would give the element at row 1 and column 2. The number of dimensions is called the rank of the array and the dimensions are often called axes.

Try the following at the prompt:

import numpy as np
import matplotlib.pyplot as plt
a = np.array([2,3,4])		#Sets up a 1-dimensional array with 3 elements
print a
b = np.array(range(100))	#Not common but does work
print b
c = np.linspace(0,99,100)	#More common, linearly space 100 numbers between 0 and 99, inclusive
print c

Let's use these commands to create the simple plots from above:

x = np.linspace(1,100,100)
y1 = np.log(x)
y2 = np.sqrt(x)

We need to use numpy's log and square root functions since those can handle arrays as inputs (the ones in the math library are expecting just single numbers, not np.arrays).

We can plot these in the same way as before:

plt.plot(x,y1)
plt.plot(x,y2)
plt.show()

Side Note: As we just demonstrated, the matplotlib library accepts both lists of numbers and arrays of numbers from the numpy library. We will go into more detail about arrays in the next lab.

Challenges

Using the Python program you wrote above, try the following:

Modify your program to plot points from 1 to 1001. Which function is larger at x=1000?
pyplot has many ways to customize your plots. Many follow the same format as those we used in basemap. Using the pyplot documentation, change your plot to show the plots in different colors and with dashed and dotted line styles.

Plotting Data:

We can use the same techniques to plot data. As a warm-up, download the scatter_plot.py. Run the program, and then, with a partner, figure out what each of line of the program does.

Next, Let's focus on the question: "Has Lyme Disease Increased?" and examine data from the Center for Disease Control (CDC) to answer that question. Let's start with the tri-state area. Here are the years and occurrences:

years = [2003,2004,2005,2006,2007,2008,2009,2010,2011]
ny = [5399,5100,5565,4460,4165,5741,4134,2385,3118]
nj = [2887,2698,3363,2432,3134,3214,4598,3320,3398]
ct = [1403,1348,1810,1788,3058,2738,2751,1964,2004]

To plot New York data as a `scatter plot' (dots at each (x,y) point), we add the commands:

import matplotlib.pyplot as plt #Library of plotting functions
plt.scatter(years, ny)
plt.show()

Challenges:

Add to your program, commands that will plot also the New York and Connecticut data. (Hint: set up each as a scatter plot, and then use a show() to display all at once).
When displaying multiple data sets on the same plot, adding colors and labels help distinguish the different data sets. For example, to add a label for New York:
```
matplotlib.pyplot.plot(years, ny, label='NY')
```
Add in labels for New Jersey and Connecticut data. You can then display a legend by adding the command:
```
plt.legend()
```
Add axis labels and a title:
```
plt.title("Lyme Disease in NY, NJ, & CT")
plt.xlabel('Years')
plt.ylabel('Number of Cases') 
```

Plotting Data from Files:

Often there is too much data to type into your program. In these cases, it is easier to read in the information from a file. Below is a mixture of novel and previously used commands for accessing file from data and strings. Try to puzzle each one out on paper and then try in Python.

The data file statesSummary.csv is from the CDC. Before starting the program, open up the csv file and see what it looks like.

Use the pyplot and csv libraries as well as numpy (nice math library):
```
import matplotlib.pyplot as plt 
import numpy as np
import csv
```
Open the file:
```
infile = open('statesSummary.csv','r')
```

Read the first line separately to pull out the years (they're the column headers in the csv file) and store them in a list years

infile = open('statesSummary.csv','rU')
reader = csv.reader(infile)
yearLine = reader.next()
years = [int(w) for w in yearLine[1:]]

Next, take the first 5 lines of the file, split them into individual numbers, and store to be used in the plot. Note that the first column has the name of the state and should be stored separately from the data to make plotting easier.
```
for i in range(5):
     line = reader.next()
     stateName = line[0]
     stateValues = [int(w) for w in line[1:]]
     color = (0.0, i/5.0, 0.0)   
     plt.scatter(years, stateValues,
           c=color, label=stateName)
```

Lastly, set up and display the plot:

plt.title("Cases of Lyme Disease") 
plt.xlabel('Years')                
plt.ylabel('Number of Cases')      
plt.legend(loc = 2,
           fontsize = 'x-small')
plt.show()

Challenges:

Try setting the colors by using the percentage of red, green, and blue:

color = (1.0,0.0,i/5.0)	#Color tuple:  100% red, 0% green, and blue increasing with i

What happens?

Modify your program so that it will make a (very crowded) plot of all 50 states.

Shapes of Regions

Since the plot of all 50 states was quite crowded, let's plot the colors to a map. We will use basemap, as we did last week. In addition, we will need the outline of the states, which are stored in 3 files in the basemap examples directory: st99_d00.dbf, st99_d00.shp, and st99_d00.shx.

The basemap webpage has an example of coloring in states by population, fillstates.py. Once you have the 3 files with the shapes of the states, you can run this program to see the map (don't worry to much about all the details, we will go through in detail a simpler one first).

We will use a simpler version of it to map the Lyme Disease data, statesFilled.py.

Try running it. What happens?
Choose two more states and fill in colors for them.
How can we represent the Lyme disease data on a map?

Mapping Regions

Let's combine the two programs together so that we're filling in states with a color. We will make the state with the highest incidence of Lyme disease the darkest color, and the lowest the lightest color. To keep the visualization folks happy, we will use a gradient of a single color ('rainbow' gradients are misleading).

For each state, we will need the total number of incidences. We start out as before:

import matplotlib.pyplot as plt 
import numpy as np
import csv
infile = open('statesSummary.csv','r')
reader = csv.reader(infile)
yearLine = reader.next()
years = [int(w) for w in yearLine[1:]]

for each state, we'll save the name and total to a list:

stateNames = []
stateTotals = []
for row in reader:
     stateNames.append(row[0])
     stateTotals.append(sum([int(r) for r in row[1:]]))

Note: The use of two 'parallel' arrays, stateNames and stateTotals, is not the best programming practice. Instead, since the information is linked (i.e. the ith state total is for the ith state name), we should store them in a linked way. In the future, we will (using a dictionary which will be covered in next week's lecture), but for now, this works!

We will scale every state total to be a fraction of the highest total:

maxCases = float(max(stateTotals))
scaledTotals = [i/maxCases for i in stateTotals]

Now, let's add in the plotting of each state. The first part is as before:

import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon

# create the map
map = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
        projection='lcc',lat_1=33,lat_2=45,lon_0=-95)

# load the shapefile, use the name 'states'
map.readshapefile('st99_d00', name='states', drawbounds=True)

ax = plt.gca() # get current axes instance

# collect the state names from the shapefile attributes so we can
# look up the shape obect for a state by it's name
names = []
for shape_dict in map.states_info:
    names.append(shape_dict['NAME'])

What changes is how we add colors to each state. The line c = ... sets the color to be 100% red, a percentage of green that's based on the scaled totals, and 100% blue. When the scaled total for a state is low, this is close to 100% red, 100% green, and 100% blue, which is white (on the computer, colors mix like light, instead of the traditional paint-- that is, as you add more, instead of getting darker (like paint), it gets lighter). When the scaled total for a state is high, the color is still 100% red and blue but the green decreases, so, the color appears more purple:

#For each state that we have Lyme Disease data:
for i in range(len(stateNames)):
     print "Plotting", stateNames[i]
     seg = map.states[names.index(stateNames[i])]
     c = (1.0,1.0-scaledTotals[i],1.0)
     poly = Polygon(seg, facecolor=c,edgecolor='black')
     ax.add_patch(poly)
     
plt.show()

(The whole file is in lymeMapped.py).

Challenges:

Make plots for the data only from the last year in the file: 2011.
Make a plot that shows the net increase in cases from 2010 to 2011 for each state.
Hint: Look at the CSV file and describe the steps you would need to compute this, then translate into Python.
Some states might have a decrease (which will give a negative number for the difference). There are several ways to handle this:
- Take the absolute value (abs()) and just map the change.
- Set any negative values to 0. To do this with a list comprehension: [0 if i < 0 else i for i in myList]
- Check for negative values when setting the color. If the change is negative, use a red scaled color to fill in the state, and if positive use a green scaled color.
(Optional): Normalize the data by population size. That is, plot for each state the per capita incidence rates (fillstates.py has population for each state).