Program 8, CSci 39542: Data Science, Hunter College

Program 8: Ticket Predictor
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023

Program Description

Program 8: Ticket Predictor. Due 10am, Wednesday, 22 March.

Learning Objective: to employ aggregation and data cleaning techniques and to fit and evaluate classification models.

Available Libraries: pandas, numpy, sklearn, and core Python 3.6+. The available packages used for scikit-learn: sklearn.model_selection, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes, sklearn.svm, and sklearn.metrics.
Data Sources: Parking Violations Data from OpenDataNYC.
Sample Datasets: Parking_Violations_Issued_Precinct_19_2021.csv

Can you predict which vehicles will get excessive number of tickets? In Lectures #7 and 8 and Chapter 19, we focused on the building of the classifers. For this program, we will use the parking data from OpenData NYC. While most vehicles have one or two tickets, some vehicles have a huge number, averaging to more than one day. We will build models to predict which vehicles are most likely to have an excessive number of tickets. The data is first cleaned to use standardized spellings of color names as well as one of three vehicle classes. The first function adds indicators for the specified categorical featuers. Subsequent functions trains a classifier on the data and returns the accuracy (score) of your classifier on the test data, as well as the classifier.

The first set of functions focus on cleaning the data. To clean the data, the first two functions that can be applied to a column of a DataFrame or Series. The second two add additional columns to be used for the analysis:

clean_reg(reg): This function takes one input parameter:
- reg: a string containing the registration status of the vehicle.
If reg is coded as passenger 'PAS' or commercial 'COM', return those values. Otherwise, return 'OTHER'.

clean_color(col): This function takes one input parameter:
- col: a string containing the color of the vehicle.
Return the following for the values of col:
- 'GRAY': for GY, GRAY, GREY,SILVE, SIL, SL,
- 'WHITE': for WH, WHITE,
- 'BLACK': for BK, BLACK, BL,
- 'BLUE': for BLUE,
- 'RED': for RED, RD,
- 'GREEN': for GR, GREEN,
- 'BROWN': for BROWN, TAN,
- Otherwise, return 'OTHER'.

add_indicators(df,cols=['Registration', 'Color', 'State']):: This function has two inputs and returns a DataFrame:
- df: a DataFrame that including the columns specified in cols.
- col: a list of names of columns in the DataFrameIt has a default value of ['Registration', 'Color', 'State'].
Returns the DataFrame with an additional indicator columns generated by get_dummies for specified columns. The drop_first flag is set to True to drop extraneous columns.

add_excessive_flag(df, threshold=5):: This function has two inputs and returns a DataFrame:
- df: a DataFrame that including the columns specified in cols.
- threshold: a numeric value. The default value is 5.
Returns the DataFrame with a new column, Excessive Tickets which is 0 if there's less threshold number of Tickets and 1 otherwise.

After applying these functions, the resulting DataFrame can then be used to build a classifer on how likely a particular car is to be one that has more than a ticket a day.

The second set of functions focus on training and testing models to classify which vehicles get excessive numbers of tickets:

split_data(df, x_cols, y_col, test_size = 0.25, random_state = 2023): This function takes 5 input parameters:
- df: a DataFrame containing with a columns units.
- x_cols: a list of the names of the column of the independent variable.
- y_col: the name of the column of the dependent variable.
- test_size: accepts a float between 0 and 1 and represents the proportion of the data set to use for training. This parameter has a default value of 0.25.
- random_state: Used as a seed to the randomization. This parameter has a default value of 1870.
Returns the data split into 4 subsets, corresponding to those returned by train_test_split: x_train, x_test, y_train, and y_test. where units is the "x" column and the input parameter, y_col_name is the "y" column.
Note: this is function is very similar to the splitting of data into training and testing sets from Program 6 and Program 7.

fit_model(x_train, y_train, model_type='logreg'): This function takes four input parameters:
- x_train: the independent variable(s) for the analysis.
- y_train: the dependent variable for the analysis.
- model_type: the type of model to use. Possible values are 'logreg', 'svm', 'nbayes', and 'rforest'. See below for the specified parameters for each model. The default value for this parameter is 'logreg'.
Fits the specifed model to the x_train and y_train data, using sklearn. Additional notes for each model:
- logreg: Logistic Regression: For logistic regression, use the SVM classifier to set up the model, sklearn.linear_model.LogisticRegression with solver solver = 'saga', regularization penalty='l2', and max iterations max_iter=5000 (Note that it's the letter L in 'l2', not a 1.).
- nbayes: Naive Bayes: use the Gaussian Naive Bayes classifier to set up the model, sklearn.naive_bayes.GaussianNB.
- svm: Support Vector Machine: use the SVM classifier to set up the model, sklearn.svm.SVC with the radial basis function kernel RBF kernel='rbf'.
- rforest: Random Forest: use the random forest classifier to set up the model, sklearn.ensemble.RandomForestClassifier with 100 estimators and the random state set to 0 (i.e. n_estimators=100, random_state=0).
The resulting model should be returned as bytestream, using pickle.
Hint: for more details on setting up each of these models, see Lecture 8 and the associated notebooks and reading.

score_model(mod_pkl,xes,yes): This function takes three input parameters:
- mod_pkl: a object serialization of a trained model. The possible model approaches are logistic regression, support vector machine, naive Bayes, and random forest.
- xes: the independent variable(s) for the analysis with the same dimensions as which the model was trained.
- yes: the dependent variable(s) for the analysis with the same dimensions as which the model was trained.
Returns the confusion matrix for the model.

compare_models(x_test, y_test, models): This function has three inputs:
- x_test: a numpy array that includes rows of equal size flattened arrays,
- y_test a numpy array that takes values 0 or 1 corresponding to the rows of x_test.
- models: a list of pickled models constructed from fit_model. The
For each of the specified models in models, calls score() function of each model on x_test and y_test. The function returns the index of the model with highest accuracy score and its accuracy score (i.e. 0 if it is the first model in the list, 1 if it is the second model in the list, etc). In case of ties for the best score, return the first one that has that value.

Let's use our functions to clean the data and build some classifiers. As a first step, we will group by licence plate number and aggregate the state, vehicle type and color by choosing the first item stored for each:

df = pd.read_csv('Parking_Q1_2021_Lexington.csv')
#Focus on features about vehicles:
df = df[['Plate ID','Plate Type','Registration State','Issue Date','Vehicle Color']]
#Drop rows that are missing info:
df = df.dropna()
print(f'Your file contains {len(df)} parking violations.')
df['Plate Type'] = df['Plate Type'].apply(clean_reg)
df['Vehicle Color'] = df['Vehicle Color'].apply(clean_color)
#Count tickets for each vehicle:
newDF =  df.groupby('Plate ID').agg(NumTickets =
    pd.NamedAgg(column = 'Plate ID', aggfunc = 'count'),
    Registration = pd.NamedAgg(column = 'Plate Type', aggfunc = 'first'),
    State = pd.NamedAgg(column = 'Registration State', aggfunc = 'first'),
    Color = pd.NamedAgg(column = 'Vehicle Color', aggfunc = 'first')
)
print(newDF)

which will print:

Your file contains 20589 parking violations.
          NumTickets Registration State  Color
Plate ID
00356R2            1          PAS    TX  WHITE
004LSM             1          PAS    TN  OTHER
00574R7            1          PAS    TX  WHITE
0064NQD            1          PAS    DP  BLACK
0107NQD            1          PAS    DP   GRAY
...              ...          ...   ...    ...
ZRB1864            1          PAS    PA  WHITE
ZSA6859            1          PAS    PA   GRAY
ZSE1922            1          PAS    PA  WHITE
ZWF62E             1          PAS    NJ  OTHER
ZWZ35J             1          PAS    NJ  OTHER

We can then, using the cleaned data, add indicator columns for the categorical data:

newDF = add_indicators(newDF)
print(newDF)

will add the indicator variables:

          NumTickets  Registration_OTHER  ...  State_WI  State_WV
Plate ID                                  ...
00356R2            1                   0  ...         0         0
004LSM             1                   0  ...         0         0
00574R7            1                   0  ...         0         0
0064NQD            1                   0  ...         0         0
0107NQD            1                   0  ...         0         0
...              ...                 ...  ...       ...       ...
ZRB1864            1                   0  ...         0         0
ZSA6859            1                   0  ...         0         0
ZSE1922            1                   0  ...         0         0
ZWF62E             1                   0  ...         0         0
ZWZ35J             1                   0  ...         0         0

While this works well for State, Registration and Vehicle Color:

print(f'Registration: {newDF['Registration'].unique()})
print(f'State: {newDF['State'].unique()})
print(f'VehicleColor: {newDF['VehicleColor'].unique()})

prints expected values for states but many different types of registrations and abbreviations and mispellings for colors:

Registration: ['PAS' 'COM' 'USC' 'MOT' 'LMB' '999' 'CMB' 'RGL' 'SRF' 'MED' 'APP' 'ORG'
'ITP' 'OMR' 'TRA' 'BOB' 'SPO' 'LMA' 'VAS' 'OML' 'TOW' 'DLR' 'AMB' 'TRC'
'STG' 'AGR' 'NLM' 'ORC' 'IRP' 'TRL' 'MCL' 'OMT' 'SCL' 'SPC' 'CHC' 'HIS'
'SRN' 'RGC' 'PHS' 'PSD' 'MCD' 'NYA' 'JCA' 'SOS' 'CSP' 'OMS' 'CBS' 'OMV'
'HAM']
State: ['DP' 'NJ' 'PA' 'TX' 'OK' 'NY' 'OH' '99' 'DC' 'AR' 'IL' 'MN' 'NC' 'NV'
'FL' 'GV' 'CA' 'NH' 'MD' 'CT' 'MO' 'RI' 'MS' 'MA' 'MI' 'TN' 'WV' 'AL'
'OR' 'KS' 'VA' 'KY' 'AZ' 'WA' 'NM' 'CO' 'SC' 'WI' 'ME' 'DE' 'HI' 'IN'
'WY' 'MT' 'NE' 'VT' 'GA' 'LA' 'SD' 'ON' 'IA' 'ID' 'ND' 'SK' 'UT' 'AK'
'QB' 'AB' 'BC' 'MX' 'PR' 'NS' 'MB' 'FO']
VehicleColor: ['BLACK' 'SILVE' 'GREY' 'WHITE' 'RED' 'OTHER' 'BLUE' 'GY' 'BLK' 'BK'
'PURPL' 'TAN' 'GREEN' 'YELLO' 'ORANG' 'BL' 'SILV' 'GRAY' 'BROWN' nan
'GRY' 'WH' 'SIL' 'GOLD' 'WT' 'WHT' 'GR' 'RD' 'YW' 'BR' 'LTG' 'WH/' 'OR'
'WHB' 'TN' 'BRN' 'MR' 'DK/' 'BLW' 'GL' 'PR' 'BU' 'DKB' 'W' 'GRT' 'ORG'
'RD/' 'LT/' 'NO' 'LTT' 'GRN' 'BN' 'TB' 'BRO' 'B' 'RDW' 'SL' 'BURG' 'BLU'
'NOC' 'BK/' 'DKG' 'WHG' 'PINK' 'G' 'LAVEN' 'BL/' 'YEL' 'OG' 'GRW' 'WHI'
'WHTE' 'BUR' 'GY/' 'DKR' 'RDT' 'GN' 'BUN' 'SV' 'BKG' 'YELL' 'WHIT' 'GR/'
'LTTN' 'SLV' 'BRWN' 'GYB' 'WHTIE' 'WI' 'BUS' 'LTB' 'TN/' 'GD' 'MAROO'
'BW' 'BLG' 'ORA' 'GRA' 'DKP' 'NAVY' 'GREG' 'GRB' 'BRW' 'BBRN' 'R' 'GRRY'
'BLA' 'BG' 'MAR' 'BURGA' 'BRWON' 'YLW' 'ORNG' 'HREY' 'DERD' 'YL' 'PLE'
'BWN' 'BI']

The first two registration types account are the most common:

count = len(newDF)
pasCount = len(newDF[newDF['Registration'] == 'PAS'])
comCount = len(newDF[newDF['Registration'] == 'COM'])
print(f'{count} different vehicles, {100*(pasCount+comCount)/count} percent are passenger or commercial plates.')

And for the Precinct District 19 dataset that contains almost a half million tickets:

159928 different vehicles, 93.95477965084288 percent are passenger or commercial plates.

And similarly, 15 of the entries for vehicle color account for most of the entries:

print(newDF['VehicleColor'].unique())
print(f"The top 15 values account for {100*newDF['VehicleColor'].value_counts()[:15].sum()/len(newDF)} percent.")
print(f"Those values are: {newDF['VehicleColor'].value_counts()[:15]}.")

which prints:

The top 15 values account for 95.37291781301586 percent.
Those values are:
WH       27814
GY       24704
WHITE    20817
BK       20778
BLACK    14486
GREY      9629
BL        9249
SILVE     5704
BLUE      5300
RD        4395
RED       3303
OTHER     2678
GR        1674
BROWN     1059
TN         938

Let's also add in a column that flags if a vehicle has an excessive number of tickets on 20 randomly selected rows:

df = add_excessive_flag(df, threshold = 4)
print(df)

which will print the DataFrame with a new column that is 1 if the number of tickets is larger than the threshold of 4:


      Plate ID  Tickets Registration State  Color  Excessive Tickets
10390   L30JWD        1          PAS    NJ  WHITE                  0
3144   80321MM        2          COM    NY  OTHER                  0
74     11153MA        1          COM    NY  WHITE                  0
11912   XHYY18        5          PAS    NJ  WHITE                  1
7705   JBW5667        2          PAS    NY  OTHER                  0
11451   W3F679        1          PAS    NY   GRAY                  0
11764   XGEY82        2          PAS    NJ  WHITE                  0
5480   FGA9371        1          PAS    NY  BLACK                  0
9958    KHBJ82        1          PAS    FL  OTHER                  0
9976   KHF3318        1          PAS    NY  OTHER                  0
7519   HZZ6700        1          PAS    NY   GRAY                  0
1473   40822ML        1          COM    NY  WHITE                  0
2098   58962JT        1          COM    NY  WHITE                  0
9379   KDR4765        3          PAS    NY  BLACK                  0
9254    K44ECD        1          PAS    NJ  OTHER                  0
2066   57819MM        1          COM    NY  WHITE                  0
11957   XJJH31        1          PAS    NJ  WHITE                  0
11437  W110304        1          PAS    NJ  WHITE                  0
4969   DDF6091        1          PAS    NY   GRAY                  0
6148   GUR4196        1          PAS    NY  BLACK                  0

Program 8: Ticket Predictor CSci 39542: Introduction to Data Science Department of Computer Science Hunter College, City University of New York Spring 2023

Program Description

Program 8: Ticket Predictor
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023