Topics: Using matplotlib & numpy: plotting mathematical functions and arrays

Today's lab focuses on other features of the popular matplotlib module: in particular, drawing functions and plotting data.

Graphing Mathematical Functions:

Last week, we used the basemap module of matplotlib. Today's lab focuses on a more common package, pyplot and some additional ways to map data using basemap The pyplot module of matplotlib provides lots of useful ways to plot data to the screen. Let's use it to answer the question, which grows faster:
y = log(x) or y = √ x ?

To test out this question, we will write a program that:

  1. Uses the math and plotting libraries.
  2. Sets up a list of numbers (x-values) for our functions.
  3. Computes the y-values of our numbers for our functions.
  4. Creates plots of the two functions.
  5. Shows the plots in a separate graphics window.
Let's add in the Python code that for each of these steps:
  1. Uses the math and plotting libraries.
    import math	
    import matplotlib.pyplot as plt
    	
    Since it's unwieldly to type "matplotlib.pyplot" before every function we'd like to use from that library, instead we'll use the common abbreviation of "plt". With this, we can plt.plot(), instead of matplotlib.pyplot.plot().
  2. Sets up a list of numbers (x-values) for our functions.
    x = range(1,101)
    
    Remember: Python starts counting at 0 and goes up to, but not including the 101. So, this creates the list [1,2,...,100].
  3. Computes the y-values of our numbers for our functions.
    y1 = []
    for i in x:
       y = math.log(i)
       y1.append(y)
    y2 = []
    for i in x:
        y = math.sqrt(i)
        y2.append(y)   
    
    We need two separate lists since we have two separate functions to graph.

    How could you rewrite the above using list comprehensions?

  4. Creates plots of the two functions.
    plt.plot(x,y1,label='y1 = log(x)')
    plt.plot(x,y2,label='y2 = sqrt(x)')
    plt.legend()
    
    Creates the plot for safe keeping but does not display it until told to (see next lines).
  5. Shows the plots in a separate graphics window.
    plt.show()
    
    This line pops up the new graphics window to display the plots.

From your plots, which do you think grows faster: log(x) or sqrt(x)?

The numpy Module

The numpy is a module for numerical analysis and is part of scipy. It is distributed with anaconda, so, is now on your machines! It's main focus is on linear algebra (manipulating of vectors and matrices) and also has differential equation solvers (i.e. it will integrate functions for you, numerically).

It is commonly used with the abbreviation, np:

import numpy as np
import matplotlib.pyplot as plt

Numpy's main object is a homogeneous multidimensional array. That is, sequences of similar objects in most any dimension. You can access the elements in the arrays by tuples of positive numbers. For example, if you have a 10 x 10 grid of numbers, it could be stored in a 2-dimensional array, and each element could be accessed by specifying the row and column using a tuple: (1,2) would give the element at row 1 and column 2. The number of dimensions is called the rank of the array and the dimensions are often called axes.

Try the following at the prompt:

import numpy as np
import matplotlib.pyplot as plt
a = np.array([2,3,4])		#Sets up a 1-dimensional array with 3 elements
print a
b = np.array(range(100))	#Not common but does work
print b
c = np.linspace(0,99,100)	#More common, linearly space 100 numbers between 0 and 99, inclusive
print c

Let's use these commands to create the simple plots from above:

x = np.linspace(1,100,100)
y1 = np.log(x)
y2 = np.sqrt(x)
We need to use numpy's log and square root functions since those can handle arrays as inputs (the ones in the math library are expecting just single numbers, not np.arrays).

We can plot these in the same way as before:

plt.plot(x,y1)
plt.plot(x,y2)
plt.show()

Side Note: As we just demonstrated, the matplotlib library accepts both lists of numbers and arrays of numbers from the numpy library. We will go into more detail about arrays in the next lab.

Challenges

Using the Python program you wrote above, try the following:

Plotting Data:

We can use the same techniques to plot data. As a warm-up, download the scatter_plot.py. Run the program, and then, with a partner, figure out what each of line of the program does.

Next, Let's focus on the question: "Has Lyme Disease Increased?" and examine data from the Center for Disease Control (CDC) to answer that question. Let's start with the tri-state area. Here are the years and occurrences:

years = [2003,2004,2005,2006,2007,2008,2009,2010,2011]
ny = [5399,5100,5565,4460,4165,5741,4134,2385,3118]
nj = [2887,2698,3363,2432,3134,3214,4598,3320,3398]
ct = [1403,1348,1810,1788,3058,2738,2751,1964,2004]

To plot New York data as a `scatter plot' (dots at each (x,y) point), we add the commands:

import matplotlib.pyplot as plt #Library of plotting functions
plt.scatter(years, ny)
plt.show()

Challenges:

Plotting Data from Files:

Often there is too much data to type into your program. In these cases, it is easier to read in the information from a file. Below is a mixture of novel and previously used commands for accessing file from data and strings. Try to puzzle each one out on paper and then try in Python.

The data file statesSummary.csv is from the CDC. Before starting the program, open up the csv file and see what it looks like.

Challenges:

Shapes of Regions

Since the plot of all 50 states was quite crowded, let's plot the colors to a map. We will use basemap, as we did last week. In addition, we will need the outline of the states, which are stored in 3 files in the basemap examples directory: st99_d00.dbf, st99_d00.shp, and st99_d00.shx.

The basemap webpage has an example of coloring in states by population, fillstates.py. Once you have the 3 files with the shapes of the states, you can run this program to see the map (don't worry to much about all the details, we will go through in detail a simpler one first).

We will use a simpler version of it to map the Lyme Disease data, statesFilled.py.

Mapping Regions

Let's combine the two programs together so that we're filling in states with a color. We will make the state with the highest incidence of Lyme disease the darkest color, and the lowest the lightest color. To keep the visualization folks happy, we will use a gradient of a single color ('rainbow' gradients are misleading).

For each state, we will need the total number of incidences. We start out as before:

import matplotlib.pyplot as plt 
import numpy as np
import csv
infile = open('statesSummary.csv','r')
reader = csv.reader(infile)
yearLine = reader.next()
years = [int(w) for w in yearLine[1:]]
for each state, we'll save the name and total to a list:
stateNames = []
stateTotals = []
for row in reader:
     stateNames.append(row[0])
     stateTotals.append(sum([int(r) for r in row[1:]]))

Note: The use of two 'parallel' arrays, stateNames and stateTotals, is not the best programming practice. Instead, since the information is linked (i.e. the ith state total is for the ith state name), we should store them in a linked way. In the future, we will (using a dictionary which will be covered in next week's lecture), but for now, this works!

We will scale every state total to be a fraction of the highest total:

maxCases = float(max(stateTotals))
scaledTotals = [i/maxCases for i in stateTotals]
Now, let's add in the plotting of each state. The first part is as before:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon

# create the map
map = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
        projection='lcc',lat_1=33,lat_2=45,lon_0=-95)

# load the shapefile, use the name 'states'
map.readshapefile('st99_d00', name='states', drawbounds=True)

ax = plt.gca() # get current axes instance

# collect the state names from the shapefile attributes so we can
# look up the shape obect for a state by it's name
names = []
for shape_dict in map.states_info:
    names.append(shape_dict['NAME'])
What changes is how we add colors to each state. The line c = ... sets the color to be 100% red, a percentage of green that's based on the scaled totals, and 100% blue. When the scaled total for a state is low, this is close to 100% red, 100% green, and 100% blue, which is white (on the computer, colors mix like light, instead of the traditional paint-- that is, as you add more, instead of getting darker (like paint), it gets lighter). When the scaled total for a state is high, the color is still 100% red and blue but the green decreases, so, the color appears more purple:
#For each state that we have Lyme Disease data:
for i in range(len(stateNames)):
     print "Plotting", stateNames[i]
     seg = map.states[names.index(stateNames[i])]
     c = (1.0,1.0-scaledTotals[i],1.0)
     poly = Polygon(seg, facecolor=c,edgecolor='black')
     ax.add_patch(poly)
     
plt.show()   
(The whole file is in lymeMapped.py).

Challenges:

Lab Report

For each lab, you should submit a lab report by the target date to: kstjohn AT amnh DOT org. The reports should be about a page for the first labs and contain the following:

Target Date: 7 March 2016
Title: Lab 6:
Name & Email:

Purpose: Give summary of what was done in this lab.
Procedure: Describe step-by-step what you did (include programs or program outlines).
Results: If applicable, show all data collected. Including screen shots is fine (can capture via the Grab program).
Discussion: Give a short explanation and interpretation of your results here.

Using Rosalind

This course will use the on-line Rosalind system for submitting programs electronically. The password for the course has been sent to your email. Before leaving lab today, complete the first two challenges.