Program 8: Ticket Predictor.   Due 10am, Wednesday, 22 March.
 Can you predict which vehicles will get excessive number of tickets?  In Lectures #7 and 8 and Chapter 19, we focused on the building of the classifers.  For this program, we will use the parking data from OpenData NYC.  While most vehicles have one or two tickets, some vehicles have a huge number, averaging to more than one day.  We will build models to predict which vehicles are most likely to have an excessive number of tickets.  The data is first cleaned to use standardized spellings of color names as well as one of three vehicle classes.  The first function adds indicators for the specified categorical featuers.  Subsequent functions trains a classifier on the data and returns the accuracy (score) of your classifier on the test data, as well as the classifier.
 The first set of functions focus on cleaning the data.  To clean the data, the first two functions that can be applied to a column of a DataFrame or Series.  The second two add additional columns to be used for the analysis:
Learning Objective: to employ aggregation and data cleaning techniques and to fit and evaluate classification models.
Available Libraries: pandas, numpy, sklearn, and core Python 3.6+. The available packages used for scikit-learn: 
sklearn.model_selection, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes, sklearn.svm, and sklearn.metrics.
Data Sources:  Parking Violations Data from OpenDataNYC.
Sample Datasets: Parking_Violations_Issued_Precinct_19_2021.csv
  
After applying these functions, the resulting DataFrame can then be used to build a classifer on how likely a particular car is to be one that has more than a ticket a day.clean_reg(reg): This function takes one input parameter:
    
      
Ifreg: a string containing the registration status of the vehicle.reg is coded as passenger 'PAS' or commercial
  'COM', return those values.  Otherwise, return 'OTHER'.
  clean_color(col): This function takes one input parameter: 
    
      
Return the following for the values ofcol: a string containing the color of the vehicle.col:
  
    
'GRAY': for GY,
      GRAY,
      GREY,SILVE,
        SIL, SL,
    'WHITE': for WH,
      WHITE,
    'BLACK': for BK,
      BLACK, BL,
    'BLUE': for BLUE,
    'RED': for RED,
        RD,
    'GREEN': for GR,
        GREEN,
    'BROWN': for BROWN,
        TAN,
    'OTHER'.
  
  add_indicators(df,cols=['Registration', 'Color', 'State'])::
    This function has two inputs and returns a DataFrame:
    
        
Returns the DataFrame with an additional indicator columns generated bydf: a DataFrame that
                including the columns specified in cols.
        col: a list of names of columns in the DataFrameIt has a default value of ['Registration', 'Color', 'State'].
    get_dummies for specified columns.  The drop_first flag is set to True to drop extraneous columns.
  
  add_excessive_flag(df, threshold=5)::
    This function has two inputs and returns a DataFrame:
    
        
Returns the DataFrame with a new column,df: a DataFrame that
                including the columns specified in cols.
        threshold: a numeric value.  The default value is 5.
    Excessive Tickets which is 0 if there's less threshold number of Tickets and 1 otherwise.
  
    
The second set of functions focus on training and testing models to classify which vehicles get excessive numbers of tickets:
split_data(df, x_cols, y_col, test_size = 0.25, random_state = 2023):
    This function takes 5 input parameters:
    df: a DataFrame containing with a columns units.x_cols: a list of the names of the column of the independent variable.y_col: the name of the column of the dependent variable.test_size: accepts a float between 0 and 1 and represents the proportion of the data set to use for training.  This parameter has a default value of 0.25.random_state:  Used as a seed to the randomization.  This parameter has a default value of 1870.x_train,
    x_test,
    y_train, and
    y_test. where units is the "x" column and the input parameter, y_col_name is the "y" column.fit_model(x_train, y_train, model_type='logreg'):
  This function takes four input parameters:
  x_train: the independent variable(s) for the analysis.y_train: the dependent variable for the analysis.model_type: the type of model to use.  Possible values are 'logreg', 'svm', 'nbayes', and 'rforest'.  See below for the specified parameters for each model.  The default value for this parameter is 'logreg'.x_train and
  y_train data, using sklearn.  Additional notes for each model:
  logreg: Logistic Regression:  For logistic regression, use the SVM classifier to set up the model, sklearn.linear_model.LogisticRegression with solver solver = 'saga', regularization penalty='l2', and max iterations max_iter=5000 (Note that it's the letter L in 'l2', not a 1.).
    
    nbayes: Naive Bayes:  use the Gaussian Naive Bayes classifier to set up the model, sklearn.naive_bayes.GaussianNB.
      
    svm: Support Vector Machine:  use the SVM classifier to set up the model, sklearn.svm.SVC with the radial basis function kernel RBF kernel='rbf'.
      
    rforest: Random Forest: use the random forest classifier to set up the model, sklearn.ensemble.RandomForestClassifier with 100 estimators and the random state set to 0  (i.e. n_estimators=100, random_state=0).
  score_model(mod_pkl,xes,yes):
  This function takes three input parameters:
  mod_pkl: a object serialization of a trained model.  The possible model approaches are logistic regression, support vector machine, naive Bayes, and random forest.
      xes: the independent variable(s) for the analysis with the same dimensions as which the model was trained.
      yes: the dependent variable(s) for the analysis with the same dimensions as which the model was trained.
  compare_models(x_test, y_test, models):
  This function has three inputs:
  x_test: a numpy array that includes rows of equal size flattened arrays,
    y_test a numpy array that takes values 0 or 1 corresponding to the rows of x_test.
    models: a list of pickled models constructed from fit_model.  The 
  models, calls score() function of each model on x_test and y_test.  The function returns the index of the model with highest accuracy score and its accuracy score (i.e. 0 if it is the first model in the list, 1 if it is the second model in the list, etc).  In case of ties for the best score, return the first one that has that value.Let's use our functions to clean the data and build some classifiers. As a first step, we will group by licence plate number and aggregate the state, vehicle type and color by choosing the first item stored for each:
df = pd.read_csv('Parking_Q1_2021_Lexington.csv')
#Focus on features about vehicles:
df = df[['Plate ID','Plate Type','Registration State','Issue Date','Vehicle Color']]
#Drop rows that are missing info:
df = df.dropna()
print(f'Your file contains {len(df)} parking violations.')
df['Plate Type'] = df['Plate Type'].apply(clean_reg)
df['Vehicle Color'] = df['Vehicle Color'].apply(clean_color)
#Count tickets for each vehicle:
newDF =  df.groupby('Plate ID').agg(NumTickets =
    pd.NamedAgg(column = 'Plate ID', aggfunc = 'count'),
    Registration = pd.NamedAgg(column = 'Plate Type', aggfunc = 'first'),
    State = pd.NamedAgg(column = 'Registration State', aggfunc = 'first'),
    Color = pd.NamedAgg(column = 'Vehicle Color', aggfunc = 'first')
)
print(newDF)Your file contains 20589 parking violations.
          NumTickets Registration State  Color
Plate ID
00356R2            1          PAS    TX  WHITE
004LSM             1          PAS    TN  OTHER
00574R7            1          PAS    TX  WHITE
0064NQD            1          PAS    DP  BLACK
0107NQD            1          PAS    DP   GRAY
...              ...          ...   ...    ...
ZRB1864            1          PAS    PA  WHITE
ZSA6859            1          PAS    PA   GRAY
ZSE1922            1          PAS    PA  WHITE
ZWF62E             1          PAS    NJ  OTHER
ZWZ35J             1          PAS    NJ  OTHERWe can then, using the cleaned data, add indicator columns for the categorical data:
newDF = add_indicators(newDF)
print(newDF)          NumTickets  Registration_OTHER  ...  State_WI  State_WV
Plate ID                                  ...
00356R2            1                   0  ...         0         0
004LSM             1                   0  ...         0         0
00574R7            1                   0  ...         0         0
0064NQD            1                   0  ...         0         0
0107NQD            1                   0  ...         0         0
...              ...                 ...  ...       ...       ...
ZRB1864            1                   0  ...         0         0
ZSA6859            1                   0  ...         0         0
ZSE1922            1                   0  ...         0         0
ZWF62E             1                   0  ...         0         0
ZWZ35J             1                   0  ...         0         0While this works well for State,
Registration and Vehicle Color:
print(f'Registration: {newDF['Registration'].unique()})
print(f'State: {newDF['State'].unique()})
print(f'VehicleColor: {newDF['VehicleColor'].unique()})Registration: ['PAS' 'COM' 'USC' 'MOT' 'LMB' '999' 'CMB' 'RGL' 'SRF' 'MED' 'APP' 'ORG'
'ITP' 'OMR' 'TRA' 'BOB' 'SPO' 'LMA' 'VAS' 'OML' 'TOW' 'DLR' 'AMB' 'TRC'
'STG' 'AGR' 'NLM' 'ORC' 'IRP' 'TRL' 'MCL' 'OMT' 'SCL' 'SPC' 'CHC' 'HIS'
'SRN' 'RGC' 'PHS' 'PSD' 'MCD' 'NYA' 'JCA' 'SOS' 'CSP' 'OMS' 'CBS' 'OMV'
'HAM']
State: ['DP' 'NJ' 'PA' 'TX' 'OK' 'NY' 'OH' '99' 'DC' 'AR' 'IL' 'MN' 'NC' 'NV'
'FL' 'GV' 'CA' 'NH' 'MD' 'CT' 'MO' 'RI' 'MS' 'MA' 'MI' 'TN' 'WV' 'AL'
'OR' 'KS' 'VA' 'KY' 'AZ' 'WA' 'NM' 'CO' 'SC' 'WI' 'ME' 'DE' 'HI' 'IN'
'WY' 'MT' 'NE' 'VT' 'GA' 'LA' 'SD' 'ON' 'IA' 'ID' 'ND' 'SK' 'UT' 'AK'
'QB' 'AB' 'BC' 'MX' 'PR' 'NS' 'MB' 'FO']
VehicleColor: ['BLACK' 'SILVE' 'GREY' 'WHITE' 'RED' 'OTHER' 'BLUE' 'GY' 'BLK' 'BK'
'PURPL' 'TAN' 'GREEN' 'YELLO' 'ORANG' 'BL' 'SILV' 'GRAY' 'BROWN' nan
'GRY' 'WH' 'SIL' 'GOLD' 'WT' 'WHT' 'GR' 'RD' 'YW' 'BR' 'LTG' 'WH/' 'OR'
'WHB' 'TN' 'BRN' 'MR' 'DK/' 'BLW' 'GL' 'PR' 'BU' 'DKB' 'W' 'GRT' 'ORG'
'RD/' 'LT/' 'NO' 'LTT' 'GRN' 'BN' 'TB' 'BRO' 'B' 'RDW' 'SL' 'BURG' 'BLU'
'NOC' 'BK/' 'DKG' 'WHG' 'PINK' 'G' 'LAVEN' 'BL/' 'YEL' 'OG' 'GRW' 'WHI'
'WHTE' 'BUR' 'GY/' 'DKR' 'RDT' 'GN' 'BUN' 'SV' 'BKG' 'YELL' 'WHIT' 'GR/'
'LTTN' 'SLV' 'BRWN' 'GYB' 'WHTIE' 'WI' 'BUS' 'LTB' 'TN/' 'GD' 'MAROO'
'BW' 'BLG' 'ORA' 'GRA' 'DKP' 'NAVY' 'GREG' 'GRB' 'BRW' 'BBRN' 'R' 'GRRY'
'BLA' 'BG' 'MAR' 'BURGA' 'BRWON' 'YLW' 'ORNG' 'HREY' 'DERD' 'YL' 'PLE'
'BWN' 'BI']count = len(newDF)
pasCount = len(newDF[newDF['Registration'] == 'PAS'])
comCount = len(newDF[newDF['Registration'] == 'COM'])
print(f'{count} different vehicles, {100*(pasCount+comCount)/count} percent are passenger or commercial plates.')And for the Precinct District 19 dataset that contains almost a half million tickets:
159928 different vehicles, 93.95477965084288 percent are passenger or commercial plates.print(newDF['VehicleColor'].unique())
print(f"The top 15 values account for {100*newDF['VehicleColor'].value_counts()[:15].sum()/len(newDF)} percent.")
print(f"Those values are: {newDF['VehicleColor'].value_counts()[:15]}.")The top 15 values account for 95.37291781301586 percent.
Those values are:
WH       27814
GY       24704
WHITE    20817
BK       20778
BLACK    14486
GREY      9629
BL        9249
SILVE     5704
BLUE      5300
RD        4395
RED       3303
OTHER     2678
GR        1674
BROWN     1059
TN         938Let's also add in a column that flags if a vehicle has an excessive number of tickets on 20 randomly selected rows:
df = add_excessive_flag(df, threshold = 4)
print(df)which will print the DataFrame with a new column that is 1 if the number of tickets is larger than the threshold of 4:
      Plate ID  Tickets Registration State  Color  Excessive Tickets
10390   L30JWD        1          PAS    NJ  WHITE                  0
3144   80321MM        2          COM    NY  OTHER                  0
74     11153MA        1          COM    NY  WHITE                  0
11912   XHYY18        5          PAS    NJ  WHITE                  1
7705   JBW5667        2          PAS    NY  OTHER                  0
11451   W3F679        1          PAS    NY   GRAY                  0
11764   XGEY82        2          PAS    NJ  WHITE                  0
5480   FGA9371        1          PAS    NY  BLACK                  0
9958    KHBJ82        1          PAS    FL  OTHER                  0
9976   KHF3318        1          PAS    NY  OTHER                  0
7519   HZZ6700        1          PAS    NY   GRAY                  0
1473   40822ML        1          COM    NY  WHITE                  0
2098   58962JT        1          COM    NY  WHITE                  0
9379   KDR4765        3          PAS    NY  BLACK                  0
9254    K44ECD        1          PAS    NJ  OTHER                  0
2066   57819MM        1          COM    NY  WHITE                  0
11957   XJJH31        1          PAS    NJ  WHITE                  0
11437  W110304        1          PAS    NJ  WHITE                  0
4969   DDF6091        1          PAS    NY   GRAY                  0
6148   GUR4196        1          PAS    NY  BLACK                  0