Program 8: Ticket Predictor. Due 10am, Wednesday, 22 March.
Can you predict which vehicles will get excessive number of tickets? In Lectures #7 and 8 and Chapter 19, we focused on the building of the classifers. For this program, we will use the parking data from OpenData NYC. While most vehicles have one or two tickets, some vehicles have a huge number, averaging to more than one day. We will build models to predict which vehicles are most likely to have an excessive number of tickets. The data is first cleaned to use standardized spellings of color names as well as one of three vehicle classes. The first function adds indicators for the specified categorical featuers. Subsequent functions trains a classifier on the data and returns the accuracy (score) of your classifier on the test data, as well as the classifier.
The first set of functions focus on cleaning the data. To clean the data, the first two functions that can be applied to a column of a DataFrame or Series. The second two add additional columns to be used for the analysis:
Learning Objective: to employ aggregation and data cleaning techniques and to fit and evaluate classification models.
Available Libraries: pandas, numpy, sklearn, and core Python 3.6+. The available packages used for scikit-learn:
sklearn.model_selection, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes, sklearn.svm, and sklearn.metrics.
Data Sources: Parking Violations Data from OpenDataNYC.
Sample Datasets: Parking_Violations_Issued_Precinct_19_2021.csv
After applying these functions, the resulting DataFrame can then be used to build a classifer on how likely a particular car is to be one that has more than a ticket a day.clean_reg(reg):
This function takes one input parameter:
If reg
: a string containing the registration status of the vehicle.reg
is coded as passenger 'PAS'
or commercial
'COM'
, return those values. Otherwise, return 'OTHER'
.
clean_color(col):
This function takes one input parameter:
Return the following for the values of col
: a string containing the color of the vehicle.col
:
'GRAY'
: for GY
,
GRAY
,
GREY
,SILVE
,
SIL
, SL
,
'WHITE'
: for WH
,
WHITE
,
'BLACK'
: for BK
,
BLACK
, BL
,
'BLUE'
: for BLUE
,
'RED'
: for RED
,
RD
,
'GREEN'
: for GR
,
GREEN
,
'BROWN'
: for BROWN
,
TAN
,
'OTHER'
.
add_indicators(df,cols=['Registration', 'Color', 'State']):
:
This function has two inputs and returns a DataFrame:
Returns the DataFrame with an additional indicator columns generated by df
: a DataFrame that
including the columns specified in cols
.
col
: a list of names of columns in the DataFrameIt has a default value of ['Registration', 'Color', 'State'].
get_dummies
for specified columns. The drop_first
flag is set to True
to drop extraneous columns.
add_excessive_flag(df, threshold=5):
:
This function has two inputs and returns a DataFrame:
Returns the DataFrame with a new column, df
: a DataFrame that
including the columns specified in cols
.
threshold
: a numeric value. The default value is 5.
Excessive Tickets
which is 0 if there's less threshold number of Tickets
and 1 otherwise.
The second set of functions focus on training and testing models to classify which vehicles get excessive numbers of tickets:
split_data(df, x_cols, y_col, test_size = 0.25, random_state = 2023):
This function takes 5 input parameters:
df
: a DataFrame containing with a columns units
.x_cols
: a list of the names of the column of the independent variable.y_col
: the name of the column of the dependent variable.test_size
: accepts a float between 0 and 1 and represents the proportion of the data set to use for training. This parameter has a default value of 0.25.random_state
: Used as a seed to the randomization. This parameter has a default value of 1870.x_train
,
x_test
,
y_train
, and
y_test
. where units
is the "x" column and the input parameter, y_col_name
is the "y" column.fit_model(x_train, y_train, model_type='logreg'):
This function takes four input parameters:
x_train
: the independent variable(s) for the analysis.y_train
: the dependent variable for the analysis.model_type
: the type of model to use. Possible values are 'logreg', 'svm', 'nbayes', and 'rforest'. See below for the specified parameters for each model. The default value for this parameter is 'logreg'.x_train
and
y_train
data, using sklearn. Additional notes for each model:
logreg
: Logistic Regression: For logistic regression, use the SVM classifier to set up the model, sklearn.linear_model.LogisticRegression
with solver solver = 'saga'
, regularization penalty='l2'
, and max iterations max_iter=5000
(Note that it's the letter L in 'l2', not a 1.).
nbayes
: Naive Bayes: use the Gaussian Naive Bayes classifier to set up the model, sklearn.naive_bayes.GaussianNB
.
svm
: Support Vector Machine: use the SVM classifier to set up the model, sklearn.svm.SVC
with the radial basis function kernel RBF kernel='rbf'
.
rforest
: Random Forest: use the random forest classifier to set up the model, sklearn.ensemble.RandomForestClassifier
with 100 estimators and the random state set to 0 (i.e. n_estimators=100, random_state=0
).
score_model(mod_pkl,xes,yes):
This function takes three input parameters:
mod_pkl
: a object serialization of a trained model. The possible model approaches are logistic regression, support vector machine, naive Bayes, and random forest.
xes
: the independent variable(s) for the analysis with the same dimensions as which the model was trained.
yes
: the dependent variable(s) for the analysis with the same dimensions as which the model was trained.
compare_models(x_test, y_test, models):
This function has three inputs:
x_test
: a numpy array that includes rows of equal size flattened arrays,
y_test
a numpy array that takes values 0 or 1 corresponding to the rows of x_test
.
models
: a list of pickled models constructed from fit_model
. The
models
, calls score()
function of each model on x_test
and y_test
. The function returns the index of the model with highest accuracy score and its accuracy score (i.e. 0 if it is the first model in the list, 1 if it is the second model in the list, etc). In case of ties for the best score, return the first one that has that value.Let's use our functions to clean the data and build some classifiers. As a first step, we will group by licence plate number and aggregate the state, vehicle type and color by choosing the first item stored for each:
df = pd.read_csv('Parking_Q1_2021_Lexington.csv')
#Focus on features about vehicles:
df = df[['Plate ID','Plate Type','Registration State','Issue Date','Vehicle Color']]
#Drop rows that are missing info:
df = df.dropna()
print(f'Your file contains {len(df)} parking violations.')
df['Plate Type'] = df['Plate Type'].apply(clean_reg)
df['Vehicle Color'] = df['Vehicle Color'].apply(clean_color)
#Count tickets for each vehicle:
newDF = df.groupby('Plate ID').agg(NumTickets =
pd.NamedAgg(column = 'Plate ID', aggfunc = 'count'),
Registration = pd.NamedAgg(column = 'Plate Type', aggfunc = 'first'),
State = pd.NamedAgg(column = 'Registration State', aggfunc = 'first'),
Color = pd.NamedAgg(column = 'Vehicle Color', aggfunc = 'first')
)
print(newDF)
which will print:
Your file contains 20589 parking violations.
NumTickets Registration State Color
Plate ID
00356R2 1 PAS TX WHITE
004LSM 1 PAS TN OTHER
00574R7 1 PAS TX WHITE
0064NQD 1 PAS DP BLACK
0107NQD 1 PAS DP GRAY
... ... ... ... ...
ZRB1864 1 PAS PA WHITE
ZSA6859 1 PAS PA GRAY
ZSE1922 1 PAS PA WHITE
ZWF62E 1 PAS NJ OTHER
ZWZ35J 1 PAS NJ OTHER
We can then, using the cleaned data, add indicator columns for the categorical data:
newDF = add_indicators(newDF)
print(newDF)
will add the indicator variables:
NumTickets Registration_OTHER ... State_WI State_WV
Plate ID ...
00356R2 1 0 ... 0 0
004LSM 1 0 ... 0 0
00574R7 1 0 ... 0 0
0064NQD 1 0 ... 0 0
0107NQD 1 0 ... 0 0
... ... ... ... ... ...
ZRB1864 1 0 ... 0 0
ZSA6859 1 0 ... 0 0
ZSE1922 1 0 ... 0 0
ZWF62E 1 0 ... 0 0
ZWZ35J 1 0 ... 0 0
While this works well for State
,
Registration
and Vehicle Color
:
print(f'Registration: {newDF['Registration'].unique()})
print(f'State: {newDF['State'].unique()})
print(f'VehicleColor: {newDF['VehicleColor'].unique()})
prints expected values for states but many different types of registrations and abbreviations and mispellings for colors:
Registration: ['PAS' 'COM' 'USC' 'MOT' 'LMB' '999' 'CMB' 'RGL' 'SRF' 'MED' 'APP' 'ORG'
'ITP' 'OMR' 'TRA' 'BOB' 'SPO' 'LMA' 'VAS' 'OML' 'TOW' 'DLR' 'AMB' 'TRC'
'STG' 'AGR' 'NLM' 'ORC' 'IRP' 'TRL' 'MCL' 'OMT' 'SCL' 'SPC' 'CHC' 'HIS'
'SRN' 'RGC' 'PHS' 'PSD' 'MCD' 'NYA' 'JCA' 'SOS' 'CSP' 'OMS' 'CBS' 'OMV'
'HAM']
State: ['DP' 'NJ' 'PA' 'TX' 'OK' 'NY' 'OH' '99' 'DC' 'AR' 'IL' 'MN' 'NC' 'NV'
'FL' 'GV' 'CA' 'NH' 'MD' 'CT' 'MO' 'RI' 'MS' 'MA' 'MI' 'TN' 'WV' 'AL'
'OR' 'KS' 'VA' 'KY' 'AZ' 'WA' 'NM' 'CO' 'SC' 'WI' 'ME' 'DE' 'HI' 'IN'
'WY' 'MT' 'NE' 'VT' 'GA' 'LA' 'SD' 'ON' 'IA' 'ID' 'ND' 'SK' 'UT' 'AK'
'QB' 'AB' 'BC' 'MX' 'PR' 'NS' 'MB' 'FO']
VehicleColor: ['BLACK' 'SILVE' 'GREY' 'WHITE' 'RED' 'OTHER' 'BLUE' 'GY' 'BLK' 'BK'
'PURPL' 'TAN' 'GREEN' 'YELLO' 'ORANG' 'BL' 'SILV' 'GRAY' 'BROWN' nan
'GRY' 'WH' 'SIL' 'GOLD' 'WT' 'WHT' 'GR' 'RD' 'YW' 'BR' 'LTG' 'WH/' 'OR'
'WHB' 'TN' 'BRN' 'MR' 'DK/' 'BLW' 'GL' 'PR' 'BU' 'DKB' 'W' 'GRT' 'ORG'
'RD/' 'LT/' 'NO' 'LTT' 'GRN' 'BN' 'TB' 'BRO' 'B' 'RDW' 'SL' 'BURG' 'BLU'
'NOC' 'BK/' 'DKG' 'WHG' 'PINK' 'G' 'LAVEN' 'BL/' 'YEL' 'OG' 'GRW' 'WHI'
'WHTE' 'BUR' 'GY/' 'DKR' 'RDT' 'GN' 'BUN' 'SV' 'BKG' 'YELL' 'WHIT' 'GR/'
'LTTN' 'SLV' 'BRWN' 'GYB' 'WHTIE' 'WI' 'BUS' 'LTB' 'TN/' 'GD' 'MAROO'
'BW' 'BLG' 'ORA' 'GRA' 'DKP' 'NAVY' 'GREG' 'GRB' 'BRW' 'BBRN' 'R' 'GRRY'
'BLA' 'BG' 'MAR' 'BURGA' 'BRWON' 'YLW' 'ORNG' 'HREY' 'DERD' 'YL' 'PLE'
'BWN' 'BI']
The first two registration types account are the most common:
count = len(newDF)
pasCount = len(newDF[newDF['Registration'] == 'PAS'])
comCount = len(newDF[newDF['Registration'] == 'COM'])
print(f'{count} different vehicles, {100*(pasCount+comCount)/count} percent are passenger or commercial plates.')
And for the Precinct District 19 dataset that contains almost a half million tickets:
159928 different vehicles, 93.95477965084288 percent are passenger or commercial plates.
And similarly, 15 of the entries for vehicle color account for most of the entries:
print(newDF['VehicleColor'].unique())
print(f"The top 15 values account for {100*newDF['VehicleColor'].value_counts()[:15].sum()/len(newDF)} percent.")
print(f"Those values are: {newDF['VehicleColor'].value_counts()[:15]}.")
which prints:
The top 15 values account for 95.37291781301586 percent.
Those values are:
WH 27814
GY 24704
WHITE 20817
BK 20778
BLACK 14486
GREY 9629
BL 9249
SILVE 5704
BLUE 5300
RD 4395
RED 3303
OTHER 2678
GR 1674
BROWN 1059
TN 938
Let's also add in a column that flags if a vehicle has an excessive number of tickets on 20 randomly selected rows:
df = add_excessive_flag(df, threshold = 4)
print(df)
which will print the DataFrame with a new column that is 1 if the number of tickets is larger than the threshold of 4:
Plate ID Tickets Registration State Color Excessive Tickets
10390 L30JWD 1 PAS NJ WHITE 0
3144 80321MM 2 COM NY OTHER 0
74 11153MA 1 COM NY WHITE 0
11912 XHYY18 5 PAS NJ WHITE 1
7705 JBW5667 2 PAS NY OTHER 0
11451 W3F679 1 PAS NY GRAY 0
11764 XGEY82 2 PAS NJ WHITE 0
5480 FGA9371 1 PAS NY BLACK 0
9958 KHBJ82 1 PAS FL OTHER 0
9976 KHF3318 1 PAS NY OTHER 0
7519 HZZ6700 1 PAS NY GRAY 0
1473 40822ML 1 COM NY WHITE 0
2098 58962JT 1 COM NY WHITE 0
9379 KDR4765 3 PAS NY BLACK 0
9254 K44ECD 1 PAS NJ OTHER 0
2066 57819MM 1 COM NY WHITE 0
11957 XJJH31 1 PAS NJ WHITE 0
11437 W110304 1 PAS NJ WHITE 0
4969 DDF6091 1 PAS NY GRAY 0
6148 GUR4196 1 PAS NY BLACK 0