Program 7, CSci 39542: Data Science, Hunter College

Program 7: Housing Model
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022

Program Description

Program 7: Housing Model. Due noon, Thursday, 17 March.
Learning Objective: to enhance on statistical skills and understanding via computation linear regression and loss functions.
Available Libraries: pandas, numpy and core Python 3.6+.
Data Sources: NYC Department of City Planning (DCP) Housing Database and Neighorhood Tabulation Areas.
Sample Datasets: Housing_Database_by_NTA.csv, NYC_population_by_NTA.csv.

This program continues the analysis from Program 6 of the NYC Housing Database.

NYC Department of City Planning (DCP) Housing Database contains all approved construction and demolition jobs since 2010. Summary information about it are provided via OpenData NYC. A summary, recorded as net housing units, by Neighborhood Tabulation Areas:

https://data.cityofnewyork.us/Housing-Development/Housing-Database-by-NTA/kyz5-72x5

The DCP also provides a summary of population in New York City from the 2000 and 2010 censuses, organized by neighborhood tabulation areas (NTAs):

https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Neighborhood-Tabulatio/swpk-hqdp

In Program 6, we looked at which features are most correlated with the increase in housing units. For this program, we will explore linear models for the dataset, as well as the data separated by borough:

The assignment is broken into the following functions to allow for unit testing:

make_df(housing_file, pop_file): This function takes two inputs:
- housing_file: the name of a CSV file containing housing units from OpenData NYC.
- pop_file: the name of a CSV file containing population counts from OpenData NYC.
The data in the two files are read and merged into a single DataFrame using nta2010 and NTA Code as the keys. If the total is null or Year differs from 2010, that row is dropped. The columns the_geom, boro, and nta2010 are dropped, and the resulting DataFrame is returned. (Hint: if you are getting a The DataFrame did not match expected output error, check to make sure that your DataFrame columns are ordered in the same way as Gradescope's Autograder.)
compute_lin_reg(x,y): This function takes two inputs:
- x: a Series containing numeric values.
- y: a Series containing numeric values.
The series are of the same length and contain numeric values only (all null and non-numeric values have been dropped). The function returns two numeric values: theta_0,theta_1 where
- theta_0 is the y-intercept of the best fitting line for x and y.
- theta_1 is the slope of the best fitting line for x and y.
computed where theta_1 is the slope (r*(std of y)/(std of x)) and theta_0 is the y-intercept ((ave of y) - theta_1*(ave of x)). (see Lecture 12 for details).
compute_boro_lr(df,xcol,ycol,boro=["All"]): This function takes three inputs:
- df: a DataFrame.
- xcol: a name of a column of df.
- ycol: a name of a column of df.
- boro: a list containing either the names of boroughs or containing only the string 'All'.
If boro is ['All'], this function behaves identically to compute_lin_reg(df[xcol],df[ycol]). Otherwise, the DataFrame is restricted to rows with Borough in boro and the restricted DataFrame is used, with columns xcol and ycol, to compute a linear regression line, returning two numeric values: theta_0,theta_1 where
- theta_0 is the y-intercept of the best fitting line for x and y.
- theta_1 is the slope of the best fitting line for x and y.
computed where theta_1 is the slope (r*(std of y)/(std of x)) and theta_0 is the y-intercept ((ave of y) - theta_1*(ave of x)). (see Lecture 12 for details).
MSE_loss(y_actual,y_estimate):: This function takes two inputs:
- y_actual: a Series containing numeric values.
- y_estimate: a Series containing numeric values.
The series are of the same length and contain numeric values only (all null and non-numeric values have been dropped). The function returns the mean square error loss function between y_actual and y_estimate (e.g. the mean of the squares of the differences).
RMSE(y_actual,y_estimate):: This function takes two inputs:
- y_actual: a Series containing numeric values.
- y_estimate: a Series containing numeric values.
The series are of the same length and contain numeric values only (all null and non-numeric values have been dropped). The function returns the square root of the mean square error loss function between y_actual and y_estimate (e.g. the square root of the mean of the squares of the differences).
compute_error(y_actual,y_estimate,loss_fnc=MSE_loss): This function takes three inputs:
- y_actual: a Series containing numeric values.
- y_estimate: a Series containing numeric values.
- loss_fnc: function that takes two numeric series as input parameters and returns a numeric value. It has a default value of MSE_loss.
The series are of the same length and contain numeric values only (all null and non-numeric values have been dropped). The result of computing the loss_fnc on the inputs y_actual and y_estimate is returned.

For example, if the housing and population data files are downloaded (and your functions are imported from a file answer), then a sample run of the program:

df = p7.make_df('Housing_Database_by_NTA.csv', 'New_York_City_Population_By_Neighborhood_Tabulation_Areas.csv')
print('The DataFrame:')
print(df.head())

And the first lines would be:

The DataFrame:
   OBJECTID       boro  ...                        NTA Name Population
1       195  Manhattan  ...  Stuyvesant Town-Cooper Village      21049
3       166      Bronx  ...                  West Concourse      39282
5        37      Bronx  ...                       Bronxdale      35538
7        14   Brooklyn  ...                         Midwood      52835
9        65  Manhattan  ...                       Yorkville      77942

[5 rows x 32 columns]

We can use our next function to compute a regression line for the Population and total columns:

theta_0, theta_1 = p7.compute_lin_reg(df['Population'],df['total'])
print(f'The regression line has slope {m} and y-intercept {b}.')

which prints:

The slope is 0.4536370834220062 and the y-intercept is -625.3358497794688.

We can check if our function that computes this directly is returning the same values as the sklearn package:

from sklearn import linear_model
reg = linear_model.LinearRegression()
X = pd.DataFrame(df['Population'])
y = pd.DataFrame(df['total'])
reg.fit(X,y)
print(f'For sklearn, the slope is {reg.coef_[0][0]} with y-intercept: {reg.intercept_[0]}.')

which prints:

For sklearn, the slope is 0.45363708342200604 with y-intercept: -625.3358497794616.

We can plot the original data with the regression line:

import matplotlib.pyplot as plt
xes = np.array([0,df['Population'].max()])
yes = theta_1*xes + theta_0
plt.scatter(df['Population'],df['total'])
plt.plot(xes,yes,color='r')
plt.title(f'Regression line with m = {theta_1:{4}.{2}} and y-intercept = {theta_0:{4}.{4}}')
plt.show()

would give the plot:

Next, let's test out the function that computes by borough. We note that for the default value of boro = ['All'], the function will be identical to compute_lin_reg:

theta_0, theta_1 = p7.compute_boro_lr(df,'Population','total')
print(f'The slope is {theta_1} and the y-intercept is {theta_0}.')

which prints, as expected:

The slope is 0.4536370834220062 and the y-intercept is -625.3358497794688.

For other values of boro, it will restrict the data set to those boroughs:

si_0, si_1 = p7.compute_boro_lr(df,'Population','total',boro=['Staten Island'])
print(f'SI:  The slope is {si_1} and the y-intercept is {si_0}.')
q_0, q_1 = p7.compute_boro_lr(df,'Population','total',boro=['Queens'])
print(f'Queens: The slope is {q_1} and the y-intercept is {q_0}.')
b_0, b_1 = p7.compute_boro_lr(df,'Population','total',boro=['Bronx','Brooklyn'])
print(f'B&B: The slope is {b_1} and the y-intercept is {b_0}.')

which prints:

SI:  The slope is 0.3892292629712051 and the y-intercept is -2.5490753943668096.
Queens: The slope is 0.39404151495697937 and the y-intercept is -27.156488411430473.
B&B: The slope is 0.3849458413516747 and the y-intercept is 1458.8043454328654.

(Images corresponding to the various values of boro = ['All'] are above.)

Lastly, we have functions that will compute the error using different loss functions. Let's start with identical columns, to make sure the functions return 0:

loss = p7.compute_error(df['total'],df['total'])
print(f'The loss is {loss} for total vs total.')

which prints:

The loss is 0.0 for total vs total.

Now, trying the function on our models:

y_est = theta_0 + theta_1*df['Population']
loss = p7.compute_error(df['total'],y_est)
print(f'The loss is {loss} for default loss function: MSE.')
loss = p7.compute_error(df['total'],y_est,loss_fnc= p7.RMSE)
print(f'The loss is {loss} for RMSE.')

which prints:

The loss is 28443412.98831634 for default loss function: MSE.
The loss is 5333.23663344468 for RMSE.

Program 7: Housing Model CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2022

Program Description

Program 7: Housing Model
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022