# Classwork: correlation & regression (pandas & seaborn), and github

## MHC 250/Seminar 4: Shaping the Future of New York City Spring 2017

### Useful Packages: Pandas & Seaborn

The second part of the lab is to popular packages to manage and visualize data. Before starting the next section, check to see if the following are installed, by typing at the Python shell (in spyder, idle, or your favorite Python interface):

```	import pandas as pd
```

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

```	conda install pandas
```
Pandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets). It incorporates most of the Python constructs and libraries that we have seen thus far.

Next, check if seaborn is installed:

```	import seaborn as sns
```

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

```	conda install seaborn
```
Seaborn is a Python visualization library based on matplotlib. It provides beautiful statistical graphics.

### Regression & Correlation

In class, we discussed the uses of regression and correlation. Let's now apply those to a data set that we used in Homework 1: the NY Fed's labor trends for recent college graduates, labor.csv.

Our goal is to see the correlation and add an linear regression to the analysis from the first homework.

Seaborn uses the data structures in pandas as its default. And given how easy it is to use, we will too. The basic structure is a DataFrame which stored data in rectangular grids.

Let's use this to visualize the labor data. First, start your file with the standard import statements:

```	import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
```

Next, let's read in the NY Fed data (this assumes that the file is called labor.csv and located in the same directory as your Python program):

```	labor = pd.read_csv('labor.csv', skiprows=13)
```
Remember how the first 13 lines had extra stuff in it? The read_csv() function has an option to skip rows that don't contain data. We now have stored the data from labor.csv into the DataFrame, labor in a single line (instead of the multiple lines it took with regular Python file I/O or the csv library).

To see if this works, try to print the column of majors:

```	print("The majors are:", labor["Major"])
```
• How would you print out the unemployment rates?

To compute the correlation between two columns, we select the columns (labor.iloc[:,[2,3]]) and then apply Pandas correlation function: corr()

```	print( labor.iloc[:,[2,3]].corr() )
```

If we wanted to compute the correlations between all columns, we can just apply the function to the whole DataFrame: labor.corr()).

In seaborn, making a regression plot is very straightforward:

```	sns.regplot(x="Underemployment Rate", y="Median Wage Early Career", data=labor)
```

Note that we specified the columns by the names that were used in the original CSV file.