Classwork: correlation & regression (pandas & seaborn), and github

MHC 250/Seminar 4:
Shaping the Future of New York City
Spring 2017

Useful Packages: Pandas & Seaborn

The second part of the lab is to popular packages to manage and visualize data. Before starting the next section, check to see if the following are installed, by typing at the Python shell (in spyder, idle, or your favorite Python interface):

	import pandas as pd

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

	conda install pandas

Pandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets). It incorporates most of the Python constructs and libraries that we have seen thus far.

Next, check if seaborn is installed:

	import seaborn as sns

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

	conda install seaborn

Seaborn is a Python visualization library based on matplotlib. It provides beautiful statistical graphics.

Regression & Correlation

In class, we discussed the uses of regression and correlation. Let's now apply those to a data set that we used in Homework 1: the NY Fed's labor trends for recent college graduates, labor.csv.

Our goal is to see the correlation and add an linear regression to the analysis from the first homework.

Seaborn uses the data structures in pandas as its default. And given how easy it is to use, we will too. The basic structure is a DataFrame which stored data in rectangular grids.

Before continuing, work through the first two sections of Pandas Tutorial: DataFrames in Python

Let's use this to visualize the labor data. First, start your file with the standard import statements:

	import numpy as np
	import pandas as pd
	import matplotlib as mpl
	import matplotlib.pyplot as plt
	import seaborn as sns

Next, let's read in the NY Fed data (this assumes that the file is called labor.csv and located in the same directory as your Python program):

	labor = pd.read_csv('labor.csv', skiprows=13)

Remember how the first 13 lines had extra stuff in it? The read_csv() function has an option to skip rows that don't contain data. We now have stored the data from labor.csv into the DataFrame, labor in a single line (instead of the multiple lines it took with regular Python file I/O or the csv library).

To see if this works, try to print the column of majors:

	print("The majors are:", labor["Major"])

How would you print out the unemployment rates?

To compute the correlation between two columns, we select the columns (labor.iloc[:,[2,3]]) and then apply Pandas correlation function: corr()

	print( labor.iloc[:,[2,3]].corr() )

If we wanted to compute the correlations between all columns, we can just apply the function to the whole DataFrame: labor.corr()).

In seaborn, making a regression plot is very straightforward:

	sns.regplot(x="Underemployment Rate", y="Median Wage Early Career", data=labor)

Note that we specified the columns by the names that were used in the original CSV file.

Additional Challenges

Are underemployment and unemployment rates correlated? Use corr() and regplot() to view the relationship.
Work through the regression plot tutorial. What else is in the tips data set? What other patterns do you see in that data?

github

github is the standard way to share and collaborate on code. It functions much as Google docs does for documents. The first part of today's classwork is to get started on github:

If you do not already have an account, create an account on github.
Work through the github for beginners tutorial.
Work through the github Hello World tutorial.
Submit your github username for Problem #1 on Homework #3. If you have more than one account, submit the username you plan to use for the programming and project for this course.