Binning Data

MHC 250/Seminar 4:
Shaping the Future of New York City
Spring 2017

Binning Data

A simple, but very powerful, technique is "binning data"-- that is grouping data into the number of occurrences for each categories. The category values can often show patterns that individual data points do not. For example, binning population by zipcode can show patterns in density that's difficult to see with individual data points.

For today's classwork, we will look at the parking tickets issues by New York City in fiscal year 2016 (see Homework 2 for details). Since the set is quite large, we will focus on the 20th precinct in which Macaulay Honors College is located. Since there were over 196,000 tickets for the FY 2016 for the 20th precinct, the file for today's classwork is the first 1000 lines: tickets.csv.

Here is a sample line of tickets.csv:

1335632335,L040HZ,FL,PAS,06/09/2015,46,SUBN,NISSA,X,35430,14510,15710,0,0020,20,74,921167,E074,0000,1213P,1207P,NY,O,4,WEST 83 ST,,0,408,C,,BBBBBBB,ALL,ALL,RED,0,0,-,0,,,,,

All lines are formatted similarly: they start with the summons number, then the license plate, registration state, plate Type, date, and continue with the information about the location and type of violation, and sometimes additional information such as the who issued the ticket and the color of the car. The first line of the file gives the entries in the order they occur in the rows.

Here are some questions we can ask about the data:

For each of these questions, we can traverse the file and count the occurrences as we go. A great way to do this is with dictionaries.

Counting Tickets per Car

How can tell which car got the most tickets? First, we need to figure out a unique way to identify different cars. Luckily, cars almost always have license plates-- with each state having a unique number. (For this simple exercise, we'll assume that each license plate is unique on its own-- not an unreasonable assumption since every state has a different schema for assigning numbers, but to be more accurate we should keep track of license plate number and issuing state.)

Open up the CSV file and look at the columns. Which column contains the license plate number? For each line, we can pull out the licence plate number and use it as a "key" for a dictionary. Here's the basic idea of our program to count tickets per car:

  1. Open CSV file.
  2. Initialize the dictionary, tickets.
  3. Read first line of column names.
  4. For each line in the file:
  5.         Let plate be the license plate on the line.
  6.         Update the count of tickets for plate.
  7. Sort the dictionary by value.
  8. Print out the top 10 license plates, in terms of ticket counts.

Lets use the csv module, as we did at the end of last weeks' classwork (if you didn't reach that part of the classwork, read that last section before continuing). We will add in the Python code for each of these steps:

Try your program to make sure it works, and then move on to the challenges below.

Binning Other Data

Now that you have a program to use as a basic template, answer the following questions:

(Remember to check the csv file for the name used for the columns and use that with your dictionary reader.)

Additional Challenges

Modify your Python programs to answer the following: