The second part of the lab is to popular packages to manage and visualize data. Before starting the next section, check to see if the following are installed, by typing at the Python shell (in spyder, idle, or your favorite Python interface):
import pandas as pd
If you get an error that the library is not found. Open up a terminal, and use conda to install it:
conda install pandasPandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets). It incorporates most of the Python constructs and libraries that we have seen thus far.
Next, check if seaborn is installed:
import seaborn as sns
If you get an error that the library is not found. Open up a terminal, and use conda to install it:
conda install seabornSeaborn is a Python visualization library based on matplotlib. It provides beautiful statistical graphics.
In class, we discussed the uses of regression and correlation. Let's now apply those to a data set that we used in Homework 1: the NY Fed's labor trends for recent college graduates, labor.csv.
Our goal is to see the correlation and add an linear regression to the analysis from the first homework.
Seaborn uses the data structures in pandas as its default. And given how easy it is to use, we will too. The basic structure is a DataFrame which stored data in rectangular grids.
Let's use this to visualize the labor data. First, start your file with the standard import statements:
import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns
Next, let's read in the NY Fed data (this assumes that the file is called labor.csv and located in the same directory as your Python program):
labor = pd.read_csv('labor.csv', skiprows=13)Remember how the first 13 lines had extra stuff in it? The read_csv() function has an option to skip rows that don't contain data. We now have stored the data from labor.csv into the DataFrame, labor in a single line (instead of the multiple lines it took with regular Python file I/O or the csv library).
To see if this works, try to print the column of majors:
print("The majors are:", labor["Major"])
To compute the correlation between two columns, we select the columns (labor.iloc[:,[2,3]]) and then apply Pandas correlation function: corr()
print( labor.iloc[:,[2,3]].corr() )
If we wanted to compute the correlations between all columns, we can just apply the function to the whole DataFrame: labor.corr()).
In seaborn, making a regression plot is very straightforward:
sns.regplot(x="Underemployment Rate", y="Median Wage Early Career", data=labor)
Note that we specified the columns by the names that were used in the original CSV file.