Binning Data, Seminar 4, MHC, CUNY, Spring 2017

Binning Data

A simple, but very powerful, technique is "binning data"-- that is grouping data into the number of occurrences for each categories. The category values can often show patterns that individual data points do not. For example, binning population by zipcode can show patterns in density that's difficult to see with individual data points.

For today's classwork, we will look at the parking tickets issues by New York City in fiscal year 2016 (see Homework 2 for details). Since the set is quite large, we will focus on the 20th precinct in which Macaulay Honors College is located. Since there were over 196,000 tickets for the FY 2016 for the 20th precinct, the file for today's classwork is the first 1000 lines: tickets.csv.

Here is a sample line of tickets.csv:

1335632335,L040HZ,FL,PAS,06/09/2015,46,SUBN,NISSA,X,35430,14510,15710,0,0020,20,74,921167,E074,0000,1213P,1207P,NY,O,4,WEST 83 ST,,0,408,C,,BBBBBBB,ALL,ALL,RED,0,0,-,0,,,,,

All lines are formatted similarly: they start with the summons number, then the license plate, registration state, plate Type, date, and continue with the information about the location and type of violation, and sometimes additional information such as the who issued the ticket and the color of the car. The first line of the file gives the entries in the order they occur in the rows.

Here are some questions we can ask about the data:

which car got the most tickets?
what color of car is most likely to get a ticket?
what type of license gets the most tickets?
which location yields the most tickets?

For each of these questions, we can traverse the file and count the occurrences as we go. A great way to do this is with dictionaries.

Counting Tickets per Car

How can tell which car got the most tickets? First, we need to figure out a unique way to identify different cars. Luckily, cars almost always have license plates-- with each state having a unique number. (For this simple exercise, we'll assume that each license plate is unique on its own-- not an unreasonable assumption since every state has a different schema for assigning numbers, but to be more accurate we should keep track of license plate number and issuing state.)

Open up the CSV file and look at the columns. Which column contains the license plate number? For each line, we can pull out the licence plate number and use it as a "key" for a dictionary. Here's the basic idea of our program to count tickets per car:

Open CSV file.
Initialize the dictionary, tickets.
Read first line of column names.
For each line in the file:
Let plate be the license plate on the line.
Update the count of tickets for plate.
Sort the dictionary by value.
Print out the top 10 license plates, in terms of ticket counts.

Lets use the csv module, as we did at the end of last weeks' classwork (if you didn't reach that part of the classwork, read that last section before continuing). We will add in the Python code for each of these steps:

We are using the csv library, so need to import it:
```
import csv	
	
```

We will use a dictionary to store the ticket information:

#Setting up a dictionary to store tickets:
tickets = {}

Our data is in tickets.csv which we can open with a dictionary reader:

#Using the dictionary reader to access by column names
f = open("tickets.csv")
reader = csv.DictReader(f)

Since we are using a dictionary reader, we can refer to columns from the CSV files by name:

for row in reader:
    plate = row["Plate ID"]
    tickets[plate] = tickets.get(plate, 0) + 1
    print("Ticket", tickets[plate], "for", plate)
f.close()

(The print() is there so we can check that it's working.)

We want the license plates that got the highest number of tickets. There are several ways to do this (see dictionary examples on class webpage for a nice way to do this with list comprehensions. Here, we first make a list of the worst offenders (sorting in reverse since we want the largest values first):
```
worst = sorted(tickets, key = tickets.__getitem__, reverse=True)
```
Once we have that list, we can just print out the top 10 values:
```
for i in range(10):
    print("Plate", worst[i], "has", tickets[worst[i]], "tickets.")
```

Try your program to make sure it works, and then move on to the challenges below.

Binning Data

MHC 250/Seminar 4:
Shaping the Future of New York City
Spring 2017

Binning Data

Counting Tickets per Car

Binning Other Data

Additional Challenges

Binning Data

MHC 250/Seminar 4: Shaping the Future of New York City Spring 2017

Binning Data

Counting Tickets per Car

Binning Other Data

Additional Challenges

MHC 250/Seminar 4:
Shaping the Future of New York City
Spring 2017