Program 3: Restaurant Rankings. Due noon, Thursday, 24 February.
The NYC Department of Health & Mental Health regularly inspects restaurants and releases the results:
These results are also available in CSV files at
OpenData NYC. This programming assignment focuses on predicting letter grades for restaurants, yet to be graded, as well computing summary statistics by neighborhood.
The assignment is broken into the following functions to allow for unit testing:
For example, assuming your functions are in the Using the We can use the numeric grade to compute the averages for neighborhoods for both provided and predicted scores:
To make it easier to find scores for neighborhoods we combine with the NTA table:
Hints:
Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting and joining data.
Available Libraries: Pandas and core Python 3.6+.
Data Sources: Neigborhood Tabulation Areas, Restaurant Inspection Data @ OpenData NYC, NYC Department of Health
Restaurant Grading.
Sample Datasets: Neighborhood Tabulation Areas: nynta.csv.
Restaurant Inspections:
restaurants1Aug21.csv,
restaurants30July.csv.
make_insp_df(file_name)
:
This function takes one input:
The function should open the file file_name
: the name of a CSV file containing Restaurant Inspection Data from OpenData NYC.
file_name
as DataFrame, keeping only the columns:
If the 'CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'SCORE', 'GRADE', 'NTA'
SCORE
is null for a row, that row should be dropped. The resulting DataFrame is returned.
predict_grade(num_violations)
:
This function takes one input:
The function should then return the letter grade that corresponds to the number of violation points num_violations
: the number of violations points.
num_violations
:
(from NYC Department of Health
Restaurant Grading).
grade2num(grade)
:
This function takes one input:
and returns the grade on a 4.0 scale for grade
: a letter grade or null value.
grade
= 'A', 'B', or 'C' (i.e. 4.0, 3.0, or 2.0, respectively). If grade
is None
or some other value,
return None
.
make_nta_df(file_name)
:
This function takes one input:
The function should open the file file_name
: the name of a CSV file containing neighborhood tabulation areas (nynta.csv).
file_name
as DataFrame, returns a DataFrame
containing only the columns, NTACode
and NTAName
.
compute_ave_grade(df,col)
:
This function takes two inputs:
This function returns a DataFrame with two columns, the df
: a DataFrame containing Parking Ticket Data from OpenData NYC.
col
: the name of a numeric-valued col in the DataFrame.
NTACode
and the average of col
for each NTA.
neighborhood_grades(ave_df,nta_df)
:
This function takes two inputs:
This function returns a DataFrame with the neighborhood names (i.e. ave_df
: a DataFrame with containing the column 'NTA'
nta_df
: a DataFrame with two columns, 'NTACode' and 'NTAName'.
NTAName
) and the columns from ave_df
. The columns NTA
and NTACode
should be dropped before returning the DataFrame.
p3.py
:
will print:
df = p3.make_insp_df('restaurants1Aug21.csv')
print(df)
Note that all the rows are included (243) but that only the 9 specified columns are retained in the DataFrame. Several rows have null entries for CAMIS DBA BORO ... SCORE GRADE NTA
0 41178124 CAFE 57 Manhattan ... 4.0 A MN15
1 50111450 CASTLE CHICKEN Bronx ... 41.0 N BX29
2 40699339 NICK GARDEN COFFEE SHOP Bronx ... 31.0 NaN BX05
3 41181395 DUNKIN' Brooklyn ... 10.0 A BK25
4 50052976 ZON BAKERY & CAFE Manhattan ... 72.0 NaN MN36
.. ... ... ... ... ... ... ...
240 50052976 ZON BAKERY & CAFE Manhattan ... 72.0 NaN MN36
241 41525768 THE WEST CAFE Brooklyn ... 10.0 A BK73
242 50111132 BUONASERA RESTAURANT PIZZA Brooklyn ... 16.0 N BK30
243 40399672 BAGELS & CREAM CAFE Queens ... 12.0 A QN06
244 50104259 ROYAL COFFEE SHOP Staten Island ... 69.0 N SI22
[243 rows x 9 columns]
GRADE
(e.g. row 2, 4, and 240) while others have letter grades (such as 'N') that are not on the list of possible grades.
SCORE
to compute the likely grade for each inspection, as both a letter and its equivalent on a 4.0 grading scale, yields:
prints many the predicted grade and equivalent numeric grade on the 4.0 scale:
df['NUM'] = df['GRADE'].apply(p3.grade2num)
df['PREDICTED'] = df['SCORE'].apply(p3.predict_grade)
df['PRE NUM'] = df['PREDICTED'].apply(p3.grade2num)
print(df[ ['DBA','SCORE','GRADE','NUM','PREDICTED','PRE NUM'] ])
DBA SCORE GRADE NUM PREDICTED PRE NUM
0 CAFE 57 4.0 A 4.0 A 4.0
1 CASTLE CHICKEN 41.0 N NaN C 2.0
2 NICK GARDEN COFFEE SHOP 31.0 NaN NaN C 2.0
3 DUNKIN' 10.0 A 4.0 A 4.0
4 ZON BAKERY & CAFE 72.0 NaN NaN C 2.0
.. ... ... ... ... ... ...
240 ZON BAKERY & CAFE 72.0 NaN NaN C 2.0
241 THE WEST CAFE 10.0 A 4.0 A 4.0
242 BUONASERA RESTAURANT PIZZA 16.0 N NaN B 3.0
243 BAGELS & CREAM CAFE 12.0 A 4.0 A 4.0
244 ROYAL COFFEE SHOP 69.0 N NaN C 2.0
[243 rows x 6 columns]
The first couple of rows are:
actual_scores = p3.compute_ave_grade(df,'NUM')
predicted_scores = p3.compute_ave_grade(df,'PRE NUM')
scores = actual_scores.join(predicted_scores, on='NTA')
print(scores.head())
NUM PRE NUM
NTA
BK09 4.0 4.000000
BK17 4.0 4.000000
BK25 4.0 4.000000
BK26 NaN 2.000000
BK28 4.0 3.250000
The first couple of rows are:
nta_df = p3.make_nta_df('nynta.csv')
scores_with_nbhd_names = p3.neighborhood_grades(scores,nta_df)
print(scores_with_nbhd_names.head())
Our predicted scores are the same but almost always decrease when we include the predicted grades from the scores reported.
NUM PRE NUM NTAName
0 4.0 4.000000 Brooklyn Heights-Cobble Hill
1 4.0 4.000000 Sheepshead Bay-Gerritsen Beach-Manhattan Beach
2 4.0 4.000000 Homecrest
3 NaN 2.000000 Gravesend
4 4.0 3.250000 Bensonhurst West
sys:1: DtypeWarning: Columns (39) have mixed types.Specify dtype option on import or set low_memory=False.
when reading in the parking ticket data. Pandas tries to infer the data type (dtype
) of the columns from the values. Since some columns are a mixture of numeric and character types this can be difficult. If the file is read in with pd.read_csv(file_name, low_memory=False)
, the entire column is read in and used to determine type.
numeric_only = True
.