CSci 39542 Syllabus    Resources    Coursework



Program 3: Restaurant Rankings
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022


Classwork    Quizzes    Homework    Project   

Program Description

Program 3: Restaurant Rankings.Due noon, Thursday, 24 February.
Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting and joining data.
Available Libraries: Pandas and core Python 3.6+.
Data Sources:
Neigborhood Tabulation Areas, Restaurant Inspection Data @ OpenData NYC, NYC Department of Health Restaurant Grading.
Sample Datasets: Neighborhood Tabulation Areas: nynta.csv.
Restaurant Inspections: restaurants1Aug21.csv, restaurants30July.csv.

The NYC Department of Health & Mental Health regularly inspects restaurants and releases the results:

These results are also available in CSV files at OpenData NYC. This programming assignment focuses on predicting letter grades for restaurants, yet to be graded, as well computing summary statistics by neighborhood. The assignment is broken into the following functions to allow for unit testing:

For example, assuming your functions are in the p3.py:

df = p3.make_insp_df('restaurants1Aug21.csv')
print(df)
will print:
        CAMIS                         DBA           BORO  ... SCORE GRADE   NTA
0    41178124                     CAFE 57      Manhattan  ...   4.0     A  MN15
1    50111450              CASTLE CHICKEN          Bronx  ...  41.0     N  BX29
2    40699339     NICK GARDEN COFFEE SHOP          Bronx  ...  31.0   NaN  BX05
3    41181395                     DUNKIN'       Brooklyn  ...  10.0     A  BK25
4    50052976           ZON BAKERY & CAFE      Manhattan  ...  72.0   NaN  MN36
..        ...                         ...            ...  ...   ...   ...   ...
240  50052976           ZON BAKERY & CAFE      Manhattan  ...  72.0   NaN  MN36
241  41525768               THE WEST CAFE       Brooklyn  ...  10.0     A  BK73
242  50111132  BUONASERA RESTAURANT PIZZA       Brooklyn  ...  16.0     N  BK30
243  40399672         BAGELS & CREAM CAFE         Queens  ...  12.0     A  QN06
244  50104259           ROYAL COFFEE SHOP  Staten Island  ...  69.0     N  SI22

[243 rows x 9 columns]
Note that all the rows are included (243) but that only the 9 specified columns are retained in the DataFrame. Several rows have null entries for GRADE (e.g. row 2, 4, and 240) while others have letter grades (such as 'N') that are not on the list of possible grades.

Using the SCORE to compute the likely grade for each inspection, as both a letter and its equivalent on a 4.0 grading scale, yields:

df['NUM'] = df['GRADE'].apply(p3.grade2num)
df['PREDICTED'] = df['SCORE'].apply(p3.predict_grade)
df['PRE NUM'] = df['PREDICTED'].apply(p3.grade2num)
print(df[ ['DBA','SCORE','GRADE','NUM','PREDICTED','PRE NUM'] ])
prints many the predicted grade and equivalent numeric grade on the 4.0 scale:
                           DBA  SCORE GRADE  NUM PREDICTED  PRE NUM
0                       CAFE 57    4.0     A  4.0         A      4.0
1                CASTLE CHICKEN   41.0     N  NaN         C      2.0
2       NICK GARDEN COFFEE SHOP   31.0   NaN  NaN         C      2.0
3                       DUNKIN'   10.0     A  4.0         A      4.0
4             ZON BAKERY & CAFE   72.0   NaN  NaN         C      2.0
..                          ...    ...   ...  ...       ...      ...
240           ZON BAKERY & CAFE   72.0   NaN  NaN         C      2.0
241               THE WEST CAFE   10.0     A  4.0         A      4.0
242  BUONASERA RESTAURANT PIZZA   16.0     N  NaN         B      3.0
243         BAGELS & CREAM CAFE   12.0     A  4.0         A      4.0
244           ROYAL COFFEE SHOP   69.0     N  NaN         C      2.0

[243 rows x 6 columns]

We can use the numeric grade to compute the averages for neighborhoods for both provided and predicted scores:

actual_scores = p3.compute_ave_grade(df,'NUM')
predicted_scores = p3.compute_ave_grade(df,'PRE NUM')
scores = actual_scores.join(predicted_scores, on='NTA')
print(scores.head())
The first couple of rows are:
      NUM   PRE NUM
NTA
BK09  4.0  4.000000
BK17  4.0  4.000000
BK25  4.0  4.000000
BK26  NaN  2.000000
BK28  4.0  3.250000

To make it easier to find scores for neighborhoods we combine with the NTA table:

nta_df = p3.make_nta_df('nynta.csv')
scores_with_nbhd_names = p3.neighborhood_grades(scores,nta_df)
print(scores_with_nbhd_names.head())
The first couple of rows are:
    NUM   PRE NUM                                         NTAName
0   4.0  4.000000                    Brooklyn Heights-Cobble Hill
1   4.0  4.000000  Sheepshead Bay-Gerritsen Beach-Manhattan Beach
2   4.0  4.000000                                       Homecrest
3   NaN  2.000000                                       Gravesend
4   4.0  3.250000                                Bensonhurst West
Our predicted scores are the same but almost always decrease when we include the predicted grades from the scores reported.

Hints:

  • Most aggregation functions have the option to ignore non-numeric data in the calculation. See for example, averaging only the numerical data in a pd.groupby using the keyword argument numeric_only = True.