 Graphing & Plotting Recipes

MHC 250/Seminar 4: Shaping the Future of New York City Spring 2017

Graphing Mathematical Functions:

The pyplot module of matplotlib provides lots of useful ways to plot data to the screen. Let's use it to answer the question, which grows faster:
y = log(x) or y = √ x ?

To test out this question, we will write a program that:

1. Uses the math and plotting libraries.
2. Sets up a list of numbers (x-values) for our functions.
3. Computes the y-values of our numbers for our functions.
4. Creates plots of the two functions.
5. Shows the plots in a separate graphics window.
Let's add in the Python code that for each of these steps:
1. Uses the math and plotting libraries.
import math
import matplotlib.pyplot as plt

Since it's unwieldly to type "matplotlib.pyplot" before every function we'd like to use from that library, instead we'll use the common abbreviation of "plt". With this, we can plt.plot(), instead of matplotlib.pyplot.plot().
2. Sets up a list of numbers (x-values) for our functions.
x = range(1,101)

Remember: Python starts counting at 0 and goes up to, but not including the 101. So, this creates the list [1,2,...,100].
3. Computes the y-values of our numbers for our functions.
y1 = []
for i in x:
y = math.log(i)
y1.append(y)
y2 = []
for i in x:
y = math.sqrt(i)
y2.append(y)

We need two separate lists since we have two separate functions to graph.
4. Creates plots of the two functions.
plt.plot(x,y1,label='y1 = log(x)')
plt.plot(x,y2,label='y2 = sqrt(x)')
plt.legend()

Creates the plot for safe keeping but does not display it until told to (see next lines).
5. Shows the plots in a separate graphics window.
plt.show()

This line pops up the new graphics window to display the plots.

From your plots, which do you think grows faster: log(x) or sqrt(x)?

Challenges

Using the Python program you wrote above, try the following:

• Modify your program to plot points from 1 to 1001. Which function is larger at x=1000?
• pyplot has many ways to customize your plots. Using the pyplot documentation, change your plot to show the plots in different colors and with dashed and dotted line styles.

Plotting Data:

We can use the same techniques to plot data. As a warm-up, download the scatter_plot.py. Run the program, and then, with a partner, figure out what each of line of the program does.

Next, Let's focus on the question: "Has Lyme Disease Increased?" and examine data from the Center for Disease Control (CDC) to answer that question. Let's start with the tri-state area. Here are the years and occurrences:

years = [2003,2004,2005,2006,2007,2008,2009,2010,2011]
ny = [5399,5100,5565,4460,4165,5741,4134,2385,3118]
nj = [2887,2698,3363,2432,3134,3214,4598,3320,3398]
ct = [1403,1348,1810,1788,3058,2738,2751,1964,2004]

To plot New York data as a `scatter plot' (dots at each (x,y) point), we add the commands:

import matplotlib.pyplot as plt #Library of plotting functions
plt.scatter(years, ny)
plt.show()

Challenges:

• Add to your program, commands that will plot also the New York and Connecticut data. (Hint: set up each as a scatter plot, and then use a show() to display all at once).
• When displaying multiple data sets on the same plot, adding colors and labels help distinguish the different data sets. For example, to add a label for New York:
plt.plot(years, ny, label='NY')

Add in labels for New Jersey and Connecticut data. You can then display a legend by adding the command:
plt.legend()

Add axis labels and a title:
plt.title("Lyme Disease in NY, NJ, & CT")
plt.xlabel('Years')
plt.ylabel('Number of Cases')

• The historical population of New York City, by borough, has been organized into lists in nycRawTotals.py. Plot the data for each borough and the overall totals.

Plotting Data from Files:

Often there is too much data to type into your program. In these cases, it is easier to read in the information from a file. Below is a mixture of novel and previously used commands for accessing file from data and strings. Try to puzzle each one out on paper and then try in Python.

The data file statesSummary.csv is from the CDC. Before starting the program, open up the csv file and see what it looks like.

• Use the pyplot libraries as well as numpy (nice math library):
import matplotlib.pyplot as plt
import numpy as np

• Open the file:
infile = open('statesSummary.csv','r')

• Read the first line separately to pull out the years (they're the column headers in the csv file) and store them in a list years
yearWords = yearLine.split(",")
years = []
for w in yearWords[1:]:
years.append(int(w))

• Next, take the first 5 lines of the file, split them into individual numbers, and store to be used in the plot. Note that the first column has the name of the state and should be stored separately from the data to make plotting easier.
for i in range(5):
words = line.split(",")
stateName = words
stateValues = []
for w in words[1:]:
stateValues.append(int(w))
color = np.random.rand(3)
plt.scatter(years, stateValues,
c=color, label=stateName)

• Lastly, set up and display the plot:
plt.title("Cases of Lyme Disease")
plt.xlabel('Years')
plt.ylabel('Number of Cases')
plt.legend(loc = 2,
fontsize = 'x-small')
plt.show()

Challenges:

• Modify your program to print out the summary data for the first 10 states.
• Modify your program to print out only the data from 2005 onward (drop the early data from your plot).

Harder Challenges:

• Modify your program to make the points proportional to the number of cases reported.
(Hint: you need an array of areas and can use arrays multiple times in the scatter command).
• Modify your program to also display the total cases every year. (Hint: you will need to create an additional array to store the totals.)

CSV Library

Above, we use the general, file I/O of Python. Another option is the very useful package designed only for handling CSV files. Unsurprisingly, it's called csv and handles much of the parsing of lines that we did above. Included below is a simple example (csvEx.py):

#Simple use of csv module.
#Assumes "in.csv" is in the same folder.
#Katherine St. John
#Spring 2016

import csv

#Using the dictionary reader to access by column names
f = open("in.csv")
m = [row['Homework'] for row in reader if int(row['Homework']) < 90]
f.close()
print m[-1]

#Using the regular csv reader (ignoring first line with column names).
#Note the use of 'with' for files:

with open("in.csv") as f: