Program 6, CSci 39542: Data Science, Hunter College

Program 6: Housing Units
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022

Program Description

Program 6: Housing Units. Due noon, Thursday, 17 March.
Learning Objective: to reinforce Pandas skills by aggregating and cleaning to use in map visualiation, and summary statistics methods in Pandas.
Available Libraries: pandas and core Python 3.6+.
Data Sources: NYC Department of City Planning (DCP) Housing Database and Neighorhood Tabulation Areas.
Sample Datasets: Housing_Database_by_NTA.csv, NYC_population_by_NTA.csv.

NYC Department of City Planning (DCP) Housing Database contains all approved construction and demolition jobs since 2010. Summary information about it are provided via OpenData NYC. A summary, recorded as net housing units, by Neighborhood Tabulation Areas:

https://data.cityofnewyork.us/Housing-Development/Housing-Database-by-NTA/kyz5-72x5

The DCP also provides a summary of population in New York City from the 2000 and 2010 censuses, organized by neighborhood tabulation areas (NTAs):

https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Neighborhood-Tabulatio/swpk-hqdp

Using the choropleth maps discussed in Lecture 8, we can visualize the growth in housing units:

If you would like to create the map above, the code is housingChoropleth.py using nta.geojson and Housing_Database_by_NTA.csv.

What factors are most correlated with the increase in housing units? This program explores which factors are most correlated with the net increase. The assignment is broken into the following functions to allow for unit testing:

make_housing_df(file_name): This function takes one input:
- file_name: the name of a CSV file containing housing units from OpenData NYC.
The function should open the file file_name as a DataFrame. If the total is null for a row, that row should be dropped. The column nta2010 should be renamed NTA Code. The resulting DataFrame is returned.
make_pop_df(file_name): This function takes one input:
- file_name: the name of a CSV file containing population counts from OpenData NYC.
The function should open the file file_name as a DataFrame. Only rows containing data for 2010 should be kept. The resulting DataFrame is returned.
combine_df(housing_df, pop_df, cols): This function takes three inputs:
- housing_df: a DataFrame containing the column NTA Code.
- pop_df: a DataFrame containing the column NTA Code.
- cols: a list containing a subset of columns of housing_df and pop_df.
Returns a DataFrame that merges the two inputted DataFrames on their common key, NTA Code. The returned DataFrame should include only the columns specified in cols.
compute_density(df, col = 'Density'): This function takes two inputs:
- df: a DataFrame containing the columns Population and Shape__Area.
- col: a string.
The function computes a new column, col, that is the quotient of the columns Population and Shape__Area. The resulting DataFrame with this additional column is returned..
most_corr(df, y = 'total', xes = ['Population','Shape__Area','Density','comp2010']):: This function takes three inputs:
- df: a DataFrame containing the columns listed in y and xes.
- xes: list of column names in df.
- y: the name of a column in df.
Returns the column name and Pearson's R correlation coefficient from xes that has the highest absolute correlation with y (i.e. the absolute value of Pearson's R).
convert_std_units(ser): This function takes one input:
- ser: a Series.
Takes a Series of numeric values and converts to standard units, that is, it computes the mean and standard deviation of ser, and for each s in ser, computes (s - mean)/(standard deviation) and returns the resulting series.

For example, if the housing and population data files are downloaded (and your functions are imported from a file p6), then a sample run of the program:

housing_df = p6.make_housing_df('Housing_Database_by_NTA.csv')
print('The housing DataFrame:')
print(housing_df.head())
pop_df = p6.make_pop_df('NYC_Population_By_NTA.csv')
print('The population DataFrame:')
print(pop_df.head())

And the first lines would be:

The housing DataFrame:
                                            the_geom  OBJECTID  ... totaladj  total
0  MULTIPOLYGON (((-73.975111993168 40.7353985164...       195  ...    11240  11240
1  MULTIPOLYGON (((-73.91191954925 40.84326617245...       166  ...    13998  13998
2  MULTIPOLYGON (((-73.861380652904 40.8713448048...        37  ...    14622  14622
3  MULTIPOLYGON (((-73.960149163484 40.6289234266...        14  ...    21773  21773
4  MULTIPOLYGON (((-73.938047836582 40.7808378162...        65  ...    48481  48481

[5 rows x 28 columns]
The population DataFrame:
    Borough  Year  ...                         NTA Name Population
195   Bronx  2010  ...               Claremont-Bathgate      31078
196   Bronx  2010  ...  Eastchester-Edenwald-Baychester      34517
197   Bronx  2010  ...       Bedford Park-Fordham North      54415
198   Bronx  2010  ...                          Belmont      27378
199   Bronx  2010  ...                        Bronxdale      35538

We can use our next function to combine the DataFrames:

cols = list(pop_df.columns) + ['boro','comp2010','comp2015','comp2019','total','Shape__Area']
df = p6.combine_df(housing_df,pop_df,cols)
print(df.head())
print(df.columns)

And the first lines would be:

     Borough  Year  FIPS County Code  ... comp2019  total   Shape__Area
0  Manhattan  2010                61  ...      NaN  11240  5.582242e+06
1      Bronx  2010                 5  ...    127.0  13998  1.937895e+07
2      Bronx  2010                 5  ...      0.0  14622  1.518562e+07
3   Brooklyn  2010                47  ...    457.0  21773  3.579958e+07
4  Manhattan  2010                61  ...    -67.0  48481  1.360204e+07

[5 rows x 12 columns]

We can compute a new column, representing population density:

df = p6.compute_density(df)
print(df[['NTA Code','Population','Density']])

And the relevant columns contain:

    NTA Code  Population   Density
0       MN50       21049  0.003771
1       BX63       39282  0.002027
2       BX07       35538  0.002340
3       BK43       52835  0.001476
4       MN32       77942  0.005730
..       ...         ...       ...
190     QN01       38894  0.000973
191     BX37       29250  0.000806
192     QN49       30773  0.000446
193     SI48       25238  0.000501
194     MN31       80771  0.003755

[195 rows x 3 columns]

We can find the columns with the largest absolute correlation coefficient R:

most_c,r = p6.most_corr(df)
print(f'For the default list of columns, the most correlated is {most_c} with r = {r}.')
comp_y = ['comp2010','comp2015','comp2019']
years_c,y_r = p6.most_corr(df, xes = comp_y)
print(f'For {comp_y}, most correlated is {years_c} with r = {r}.')

And the first lines would be:

For the default list of columns, the most correlated is Population with r = 0.4001044380137653.
For ['comp2010', 'comp2015', 'comp2019'], most correlated is comp2010 with r = 0.4001044380137653.

A way to see the relationships is to plot the data. Plotting the raw data with:

import seaborn as sns
import matplotlib.pyplot as plt
plt.scatter(x=df["Population"],y=df['total'])
plt.scatter(x=df["comp2010"],y=df['total'])
plt.scatter(x=df["Density"],y=df['total'])
plt.title('Increase in Housing Units (2010-2019) by NTA')
plt.ylabel('Change in Units by NTA')
plt.legend(labels=['Population','Completed in 2010','Density'])
plt.show()

yields

which is difficult to use for comparisons due to the difference in units between the different Series. We can transform the Series into standard units:

tot_scaled = p6.convert_std_units(df['total'])
plt.scatter(x=p6.convert_std_units(df["Population"]),y=tot_scaled)
plt.scatter(x=p6.convert_std_units(df["comp2010"]),y=tot_scaled)
plt.scatter(x=p6.convert_std_units(df["Density"]),y=tot_scaled)
#plt.scatter(x="comp2010",y='total',data=df)
#sns.lmplot(x="Density",y='total',data=df)
plt.title('Increase in Housing Units by NTA (Standard Units)')
plt.ylabel('Change in Units, 2010-2019')
plt.legend(labels=['Population','Completed in 2010','Density'])
plt.show()

gives a plot where the relationships are clearer:

Program 6: Housing Units CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2022

Program Description

Program 6: Housing Units
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022