Program 3, CSci 39542: Data Science, Hunter College

Program 3: Restaurant Rankings
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022

Program Description

Program 3: Restaurant Rankings. Due noon, Thursday, 24 February.
Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting and joining data.
Available Libraries: Pandas and core Python 3.6+.
Data Sources: Neigborhood Tabulation Areas, Restaurant Inspection Data @ OpenData NYC, NYC Department of Health Restaurant Grading.
Sample Datasets: Neighborhood Tabulation Areas: nynta.csv.
Restaurant Inspections: restaurants1Aug21.csv, restaurants30July.csv.

The NYC Department of Health & Mental Health regularly inspects restaurants and releases the results:

These results are also available in CSV files at OpenData NYC. This programming assignment focuses on predicting letter grades for restaurants, yet to be graded, as well computing summary statistics by neighborhood. The assignment is broken into the following functions to allow for unit testing:

make_insp_df(file_name): This function takes one input:
- file_name: the name of a CSV file containing Restaurant Inspection Data from OpenData NYC.
The function should open the file file_name as DataFrame, keeping only the columns:
```
'CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'SCORE', 'GRADE', 'NTA'
```
If the SCORE is null for a row, that row should be dropped. The resulting DataFrame is returned.
predict_grade(num_violations): This function takes one input:
- num_violations: the number of violations points.
The function should then return the letter grade that corresponds to the number of violation points num_violations:
- "A" grade: 0 to 13 points
- "B" grade: 14 to 27 points
- "C" grade: 28 or more points
(from NYC Department of Health Restaurant Grading).
grade2num(grade): This function takes one input:
- grade: a letter grade or null value.
and returns the grade on a 4.0 scale for grade = 'A', 'B', or 'C' (i.e. 4.0, 3.0, or 2.0, respectively). If grade is None or some other value, return None.
make_nta_df(file_name): This function takes one input:
- file_name: the name of a CSV file containing neighborhood tabulation areas (nynta.csv).
The function should open the file file_name as DataFrame, returns a DataFrame containing only the columns, NTACode and NTAName.
compute_ave_grade(df,col): This function takes two inputs:
- df: a DataFrame containing Parking Ticket Data from OpenData NYC.
- col: the name of a numeric-valued col in the DataFrame.
This function returns a DataFrame with two columns, the NTACode and the average of col for each NTA.
neighborhood_grades(ave_df,nta_df): This function takes two inputs:
- ave_df: a DataFrame with containing the column 'NTA'
- nta_df: a DataFrame with two columns, 'NTACode' and 'NTAName'.
This function returns a DataFrame with the neighborhood names (i.e. NTAName) and the columns from ave_df. The columns NTA and NTACode should be dropped before returning the DataFrame.

For example, assuming your functions are in the p3.py:

df = p3.make_insp_df('restaurants1Aug21.csv')
print(df)

will print:

        CAMIS                         DBA           BORO  ... SCORE GRADE   NTA
0    41178124                     CAFE 57      Manhattan  ...   4.0     A  MN15
1    50111450              CASTLE CHICKEN          Bronx  ...  41.0     N  BX29
2    40699339     NICK GARDEN COFFEE SHOP          Bronx  ...  31.0   NaN  BX05
3    41181395                     DUNKIN'       Brooklyn  ...  10.0     A  BK25
4    50052976           ZON BAKERY & CAFE      Manhattan  ...  72.0   NaN  MN36
..        ...                         ...            ...  ...   ...   ...   ...
240  50052976           ZON BAKERY & CAFE      Manhattan  ...  72.0   NaN  MN36
241  41525768               THE WEST CAFE       Brooklyn  ...  10.0     A  BK73
242  50111132  BUONASERA RESTAURANT PIZZA       Brooklyn  ...  16.0     N  BK30
243  40399672         BAGELS & CREAM CAFE         Queens  ...  12.0     A  QN06
244  50104259           ROYAL COFFEE SHOP  Staten Island  ...  69.0     N  SI22

[243 rows x 9 columns]

Note that all the rows are included (243) but that only the 9 specified columns are retained in the DataFrame. Several rows have null entries for GRADE (e.g. row 2, 4, and 240) while others have letter grades (such as 'N') that are not on the list of possible grades.

Using the SCORE to compute the likely grade for each inspection, as both a letter and its equivalent on a 4.0 grading scale, yields:

df['NUM'] = df['GRADE'].apply(p3.grade2num)
df['PREDICTED'] = df['SCORE'].apply(p3.predict_grade)
df['PRE NUM'] = df['PREDICTED'].apply(p3.grade2num)
print(df[ ['DBA','SCORE','GRADE','NUM','PREDICTED','PRE NUM'] ])

prints many the predicted grade and equivalent numeric grade on the 4.0 scale:

                           DBA  SCORE GRADE  NUM PREDICTED  PRE NUM
0                       CAFE 57    4.0     A  4.0         A      4.0
1                CASTLE CHICKEN   41.0     N  NaN         C      2.0
2       NICK GARDEN COFFEE SHOP   31.0   NaN  NaN         C      2.0
3                       DUNKIN'   10.0     A  4.0         A      4.0
4             ZON BAKERY & CAFE   72.0   NaN  NaN         C      2.0
..                          ...    ...   ...  ...       ...      ...
240           ZON BAKERY & CAFE   72.0   NaN  NaN         C      2.0
241               THE WEST CAFE   10.0     A  4.0         A      4.0
242  BUONASERA RESTAURANT PIZZA   16.0     N  NaN         B      3.0
243         BAGELS & CREAM CAFE   12.0     A  4.0         A      4.0
244           ROYAL COFFEE SHOP   69.0     N  NaN         C      2.0

[243 rows x 6 columns]

We can use the numeric grade to compute the averages for neighborhoods for both provided and predicted scores:

actual_scores = p3.compute_ave_grade(df,'NUM')
predicted_scores = p3.compute_ave_grade(df,'PRE NUM')
scores = actual_scores.join(predicted_scores, on='NTA')
print(scores.head())

The first couple of rows are:

      NUM   PRE NUM
NTA
BK09  4.0  4.000000
BK17  4.0  4.000000
BK25  4.0  4.000000
BK26  NaN  2.000000
BK28  4.0  3.250000

To make it easier to find scores for neighborhoods we combine with the NTA table:

nta_df = p3.make_nta_df('nynta.csv')
scores_with_nbhd_names = p3.neighborhood_grades(scores,nta_df)
print(scores_with_nbhd_names.head())

The first couple of rows are:

    NUM   PRE NUM                                         NTAName
0   4.0  4.000000                    Brooklyn Heights-Cobble Hill
1   4.0  4.000000  Sheepshead Bay-Gerritsen Beach-Manhattan Beach
2   4.0  4.000000                                       Homecrest
3   NaN  2.000000                                       Gravesend
4   4.0  3.250000                                Bensonhurst West

Our predicted scores are the same but almost always decrease when we include the predicted grades from the scores reported.

Hints:

You should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing. If your file includes code outside of functions, either comment the code out before submitting or use a main function that is conditionally executed (see Think CS: Section 6.8 for details).
Restaurant inspection data can be found at: NYC OpenData.
Some datasets for testing:

restaurants1Aug21.csv
restaurants30July.csv

Neigborhood Tabulation Areas designate neighborhoods in New York City. The complete NTA file is:

nynta.csv

You may get a warning such as: sys:1: DtypeWarning: Columns (39) have mixed types.Specify dtype option on import or set low_memory=False. when reading in the parking ticket data. Pandas tries to infer the data type (dtype) of the columns from the values. Since some columns are a mixture of numeric and character types this can be difficult. If the file is read in with pd.read_csv(file_name, low_memory=False), the entire column is read in and used to determine type.

Most aggregation functions have the option to ignore non-numeric data in the calculation. See for example, averaging only the numerical data in a pd.groupby using the keyword argument numeric_only = True.

Program 3: Restaurant Rankings CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2022

Program Description

Program 3: Restaurant Rankings
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022