CSci 39542 Syllabus    Resources    Coursework



Program 8: Ticket Predictor
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2023


Classwork    Quizzes    Homework    Project   

Program Description


Program 8: Ticket Predictor.Due 10am, Wednesday, 22 March.

Learning Objective: to employ aggregation and data cleaning techniques and to fit and evaluate classification models.

Available Libraries: pandas, numpy, sklearn, and core Python 3.6+. The available packages used for scikit-learn: sklearn.model_selection, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes, sklearn.svm, and sklearn.metrics.
Data Sources:
Parking Violations Data from OpenDataNYC.
Sample Datasets: Parking_Violations_Issued_Precinct_19_2021.csv

Can you predict which vehicles will get excessive number of tickets? In Lectures #7 and 8 and Chapter 19, we focused on the building of the classifers. For this program, we will use the parking data from OpenData NYC. While most vehicles have one or two tickets, some vehicles have a huge number, averaging to more than one day. We will build models to predict which vehicles are most likely to have an excessive number of tickets. The data is first cleaned to use standardized spellings of color names as well as one of three vehicle classes. The first function adds indicators for the specified categorical featuers. Subsequent functions trains a classifier on the data and returns the accuracy (score) of your classifier on the test data, as well as the classifier.

The first set of functions focus on cleaning the data. To clean the data, the first two functions that can be applied to a column of a DataFrame or Series. The second two add additional columns to be used for the analysis:

After applying these functions, the resulting DataFrame can then be used to build a classifer on how likely a particular car is to be one that has more than a ticket a day.

The second set of functions focus on training and testing models to classify which vehicles get excessive numbers of tickets:

Let's use our functions to clean the data and build some classifiers. As a first step, we will group by licence plate number and aggregate the state, vehicle type and color by choosing the first item stored for each:

df = pd.read_csv('Parking_Q1_2021_Lexington.csv')
#Focus on features about vehicles:
df = df[['Plate ID','Plate Type','Registration State','Issue Date','Vehicle Color']]
#Drop rows that are missing info:
df = df.dropna()
print(f'Your file contains {len(df)} parking violations.')
df['Plate Type'] = df['Plate Type'].apply(clean_reg)
df['Vehicle Color'] = df['Vehicle Color'].apply(clean_color)
#Count tickets for each vehicle:
newDF =  df.groupby('Plate ID').agg(NumTickets =
    pd.NamedAgg(column = 'Plate ID', aggfunc = 'count'),
    Registration = pd.NamedAgg(column = 'Plate Type', aggfunc = 'first'),
    State = pd.NamedAgg(column = 'Registration State', aggfunc = 'first'),
    Color = pd.NamedAgg(column = 'Vehicle Color', aggfunc = 'first')
)
print(newDF)
which will print:
Your file contains 20589 parking violations.
          NumTickets Registration State  Color
Plate ID
00356R2            1          PAS    TX  WHITE
004LSM             1          PAS    TN  OTHER
00574R7            1          PAS    TX  WHITE
0064NQD            1          PAS    DP  BLACK
0107NQD            1          PAS    DP   GRAY
...              ...          ...   ...    ...
ZRB1864            1          PAS    PA  WHITE
ZSA6859            1          PAS    PA   GRAY
ZSE1922            1          PAS    PA  WHITE
ZWF62E             1          PAS    NJ  OTHER
ZWZ35J             1          PAS    NJ  OTHER

We can then, using the cleaned data, add indicator columns for the categorical data:

newDF = add_indicators(newDF)
print(newDF)
will add the indicator variables:
          NumTickets  Registration_OTHER  ...  State_WI  State_WV
Plate ID                                  ...
00356R2            1                   0  ...         0         0
004LSM             1                   0  ...         0         0
00574R7            1                   0  ...         0         0
0064NQD            1                   0  ...         0         0
0107NQD            1                   0  ...         0         0
...              ...                 ...  ...       ...       ...
ZRB1864            1                   0  ...         0         0
ZSA6859            1                   0  ...         0         0
ZSE1922            1                   0  ...         0         0
ZWF62E             1                   0  ...         0         0
ZWZ35J             1                   0  ...         0         0

While this works well for State, Registration and Vehicle Color:

print(f'Registration: {newDF['Registration'].unique()})
print(f'State: {newDF['State'].unique()})
print(f'VehicleColor: {newDF['VehicleColor'].unique()})
prints expected values for states but many different types of registrations and abbreviations and mispellings for colors:
Registration: ['PAS' 'COM' 'USC' 'MOT' 'LMB' '999' 'CMB' 'RGL' 'SRF' 'MED' 'APP' 'ORG'
'ITP' 'OMR' 'TRA' 'BOB' 'SPO' 'LMA' 'VAS' 'OML' 'TOW' 'DLR' 'AMB' 'TRC'
'STG' 'AGR' 'NLM' 'ORC' 'IRP' 'TRL' 'MCL' 'OMT' 'SCL' 'SPC' 'CHC' 'HIS'
'SRN' 'RGC' 'PHS' 'PSD' 'MCD' 'NYA' 'JCA' 'SOS' 'CSP' 'OMS' 'CBS' 'OMV'
'HAM']
State: ['DP' 'NJ' 'PA' 'TX' 'OK' 'NY' 'OH' '99' 'DC' 'AR' 'IL' 'MN' 'NC' 'NV'
'FL' 'GV' 'CA' 'NH' 'MD' 'CT' 'MO' 'RI' 'MS' 'MA' 'MI' 'TN' 'WV' 'AL'
'OR' 'KS' 'VA' 'KY' 'AZ' 'WA' 'NM' 'CO' 'SC' 'WI' 'ME' 'DE' 'HI' 'IN'
'WY' 'MT' 'NE' 'VT' 'GA' 'LA' 'SD' 'ON' 'IA' 'ID' 'ND' 'SK' 'UT' 'AK'
'QB' 'AB' 'BC' 'MX' 'PR' 'NS' 'MB' 'FO']
VehicleColor: ['BLACK' 'SILVE' 'GREY' 'WHITE' 'RED' 'OTHER' 'BLUE' 'GY' 'BLK' 'BK'
'PURPL' 'TAN' 'GREEN' 'YELLO' 'ORANG' 'BL' 'SILV' 'GRAY' 'BROWN' nan
'GRY' 'WH' 'SIL' 'GOLD' 'WT' 'WHT' 'GR' 'RD' 'YW' 'BR' 'LTG' 'WH/' 'OR'
'WHB' 'TN' 'BRN' 'MR' 'DK/' 'BLW' 'GL' 'PR' 'BU' 'DKB' 'W' 'GRT' 'ORG'
'RD/' 'LT/' 'NO' 'LTT' 'GRN' 'BN' 'TB' 'BRO' 'B' 'RDW' 'SL' 'BURG' 'BLU'
'NOC' 'BK/' 'DKG' 'WHG' 'PINK' 'G' 'LAVEN' 'BL/' 'YEL' 'OG' 'GRW' 'WHI'
'WHTE' 'BUR' 'GY/' 'DKR' 'RDT' 'GN' 'BUN' 'SV' 'BKG' 'YELL' 'WHIT' 'GR/'
'LTTN' 'SLV' 'BRWN' 'GYB' 'WHTIE' 'WI' 'BUS' 'LTB' 'TN/' 'GD' 'MAROO'
'BW' 'BLG' 'ORA' 'GRA' 'DKP' 'NAVY' 'GREG' 'GRB' 'BRW' 'BBRN' 'R' 'GRRY'
'BLA' 'BG' 'MAR' 'BURGA' 'BRWON' 'YLW' 'ORNG' 'HREY' 'DERD' 'YL' 'PLE'
'BWN' 'BI']
The first two registration types account are the most common:
count = len(newDF)
pasCount = len(newDF[newDF['Registration'] == 'PAS'])
comCount = len(newDF[newDF['Registration'] == 'COM'])
print(f'{count} different vehicles, {100*(pasCount+comCount)/count} percent are passenger or commercial plates.')

And for the Precinct District 19 dataset that contains almost a half million tickets:

159928 different vehicles, 93.95477965084288 percent are passenger or commercial plates.
And similarly, 15 of the entries for vehicle color account for most of the entries:
print(newDF['VehicleColor'].unique())
print(f"The top 15 values account for {100*newDF['VehicleColor'].value_counts()[:15].sum()/len(newDF)} percent.")
print(f"Those values are: {newDF['VehicleColor'].value_counts()[:15]}.")
which prints:
The top 15 values account for 95.37291781301586 percent.
Those values are:
WH       27814
GY       24704
WHITE    20817
BK       20778
BLACK    14486
GREY      9629
BL        9249
SILVE     5704
BLUE      5300
RD        4395
RED       3303
OTHER     2678
GR        1674
BROWN     1059
TN         938

Let's also add in a column that flags if a vehicle has an excessive number of tickets on 20 randomly selected rows:

df = add_excessive_flag(df, threshold = 4)
print(df)

which will print the DataFrame with a new column that is 1 if the number of tickets is larger than the threshold of 4:


      Plate ID  Tickets Registration State  Color  Excessive Tickets
10390   L30JWD        1          PAS    NJ  WHITE                  0
3144   80321MM        2          COM    NY  OTHER                  0
74     11153MA        1          COM    NY  WHITE                  0
11912   XHYY18        5          PAS    NJ  WHITE                  1
7705   JBW5667        2          PAS    NY  OTHER                  0
11451   W3F679        1          PAS    NY   GRAY                  0
11764   XGEY82        2          PAS    NJ  WHITE                  0
5480   FGA9371        1          PAS    NY  BLACK                  0
9958    KHBJ82        1          PAS    FL  OTHER                  0
9976   KHF3318        1          PAS    NY  OTHER                  0
7519   HZZ6700        1          PAS    NY   GRAY                  0
1473   40822ML        1          COM    NY  WHITE                  0
2098   58962JT        1          COM    NY  WHITE                  0
9379   KDR4765        3          PAS    NY  BLACK                  0
9254    K44ECD        1          PAS    NJ  OTHER                  0
2066   57819MM        1          COM    NY  WHITE                  0
11957   XJJH31        1          PAS    NJ  WHITE                  0
11437  W110304        1          PAS    NJ  WHITE                  0
4969   DDF6091        1          PAS    NY   GRAY                  0
6148   GUR4196        1          PAS    NY  BLACK                  0