A simple, but very powerful, technique is "binning data"-- that is grouping data into the number of occurrences for each categories. The category values can often show patterns that individual data points do not. For example, binning population by zipcode can show patterns in density that's difficult to see with individual data points.
For today's classwork, we will look at the parking tickets issues by New York City in fiscal year 2016 (see Homework 2 for details). Since the set is quite large, we will focus on the 20th precinct in which Macaulay Honors College is located. Since there were over 196,000 tickets for the FY 2016 for the 20th precinct, the file for today's classwork is the first 1000 lines: tickets.csv.
Here is a sample line of tickets.csv:
1335632335,L040HZ,FL,PAS,06/09/2015,46,SUBN,NISSA,X,35430,14510,15710,0,0020,20,74,921167,E074,0000,1213P,1207P,NY,O,4,WEST 83 ST,,0,408,C,,BBBBBBB,ALL,ALL,RED,0,0,-,0,,,,,
All lines are formatted similarly: they start with the summons number, then the license plate, registration state, plate Type, date, and continue with the information about the location and type of violation, and sometimes additional information such as the who issued the ticket and the color of the car. The first line of the file gives the entries in the order they occur in the rows.
Here are some questions we can ask about the data:
How can tell which car got the most tickets? First, we need to figure out a unique way to identify different cars. Luckily, cars almost always have license plates-- with each state having a unique number. (For this simple exercise, we'll assume that each license plate is unique on its own-- not an unreasonable assumption since every state has a different schema for assigning numbers, but to be more accurate we should keep track of license plate number and issuing state.)
Open up the CSV file and look at the columns. Which column contains the license plate number? For each line, we can pull out the licence plate number and use it as a "key" for a dictionary. Here's the basic idea of our program to count tickets per car:
Lets use the csv module, as we did at the end of last weeks' classwork (if you didn't reach that part of the classwork, read that last section before continuing). We will add in the Python code for each of these steps:
import csv
#Setting up a dictionary to store tickets: tickets = {}
#Using the dictionary reader to access by column names f = open("tickets.csv") reader = csv.DictReader(f)
for row in reader: plate = row["Plate ID"] tickets[plate] = tickets.get(plate, 0) + 1 print("Ticket", tickets[plate], "for", plate) f.close()(The print() is there so we can check that it's working.)
worst = sorted(tickets, key = tickets.__getitem__, reverse=True)Once we have that list, we can just print out the top 10 values:
for i in range(10): print("Plate", worst[i], "has", tickets[worst[i]], "tickets.")
Now that you have a program to use as a basic template, answer the following questions: