Structured Data: Hunter College OpenData Week Showcase

Using Python

We will be using the IDLE programming environment for Python, since it is very simple and comes with all distributions of Python (if you would prefer to use another programming environment, Spyder is loaded on the Hunter machines).

To launch IDLE:

Click the icon on the left for the terminal. It will launch a terminal window in which you can type commands.
In the terminal window, type:
```
idle3
```
(followed by an enter/return).
A new window will launch for IDLE.
To see that it works, type at the IDLE prompt:
```
print("Hello, World!")
```

Instead of using the shell window (where we can try things immediately), let's use a text window, where we can save our program for later.

First, open up a text window: on the menu bar, choose "File" and from that menu, choose "New File".

In that window, type:

#Name:  ...your name here...
#Date: August 25, 2017
#This program prints: Hello, World!

print("Hello, World!")

Save the program (using the "Save" under the "File" menu). When you save it, name it something that you will be remember for the future and end it in .py. For example, ps1.py. If you would like to have the programs you write for later, you can save your programs to a USB drive, DropBox, or mail them to yourself.
Run your program (using the "Run Module" from the "Run" menu).
If it prints "Hello, World!" to the screen, you're program was successful!

Binning Data: Parking Tickets

Via the NYC Open Data project, you can access data from almost every city agency. Today, we will look at the parking tickets issued by New York City. We will use a small version (1000 lines), but you are welcome to also use any neighborhood in the city. To download data for a given neighborhood (and restricted to just fiscal year 2018, since the data sets can be quite large):

Instead of neighborhood name or zipcode, parking tickets are classified by the police precinct that issued the ticket. First, figure out the precinct (here's a useful tool).
Fiscal year 2018 (FY2018) parking tickets for your precinct: download the data set as a CSV, filtering by "Violation Precinct". For example, Hunter College is located in the 19th Precinct, so, you would enter 19 on the filter.

A simple, but very powerful, technique is "binning data"-- that is grouping data into the number of occurrences for each categories. The category values can often show patterns that individual data points do not. For example, binning population by zipcode can show patterns in density that's difficult to see with individual data points.

CSV Data Files

To make reading files easier, we will use the Pandas library that lets you read in structured data files very efficiently. Pandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets).

(Pandas is installed on all the Hunter machines. If you are using your own machine, see these directions for installing packages for Python.)

In Pandas, the basic structure is a DataFrame which stored data in rectangular grids. The Comma-Separated-Values (CSV) files, we saw in the previous section, store tabular information in readable text files. The files downloaded above have information separated by commas (using tabs as delimiters is also common). Here is a sample line:

1335632335,L040HZ,FL,PAS,06/09/2015,46,SUBN,NISSA,X,35430,14510,15710,0,0020,20,74,921167,E074,0000,1213P,1207P,NY,O,4,WEST 83 ST,,0,408,C,,BBBBBBB,ALL,ALL,RED,0,0,-,0,,,,,

All lines are formatted similarly: they start with the summons number, then the license plate, registration state, plate Type, date, and continue with the information about the location and type of violation, and sometimes additional information such as the who issued the ticket and the color of the car. The first line of the file gives the entries in the order they occur in the rows.

The sample entry above gives details for a ticket issues on June 9, 2015 to a passenger car with Florida plates, L040HZ. The red Nissan SUV received the ticket on West 83rd Street. Each entry also begins with a unique identifier that can be used to look up the parking ticket.

Since there were over 196,000 tickets for the FY 2016 for the 20th precinct, the file for today's classwork is the first 1000 lines: tickets.csv.

Here are some questions we can ask about the data:

Which car got the most tickets?
What color of car is most likely to get a ticket?
What type of license gets the most tickets?
Which location yields the most tickets?

For each of these questions, we can traverse the file and count the occurrences as we go.

Counting Tickets per Car

How can tell which car got the most tickets? First, we need to figure out a unique way to identify different cars. Luckily, cars almost always have license plates-- with each state having a unique number. (For this simple exercise, we'll assume that each license plate is unique on its own-- not an unreasonable assumption since every state has a different schema for assigning numbers, but to be more accurate we should keep track of license plate number and issuing state.)

Open up the CSV file and look at the columns. Which column contains the license plate number? Here's all the column names (first line of tickets.csv):

Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation

It's the second column: Plate ID. Scanning the CSV file, it looks like most cars got one or two tickets. How can we get the worst offenders (i.e. those cars that got the most parking tickets)?

Let's use Pandas to read in the CSV file, following the same pattern as last lab:

#CSci 127 Teaching Staff
#October 2017
#Count which cars got the most parking tickets

#Import pandas for reading and analyzing CSV data:
import pandas as pd

csvFile = "tickets.csv"			#Name of the CSV file
tickets = pd.read_csv(csvFile)		#Read in the file to a dataframe
print(tickets) 				#Print out the dataframe

Run your program (make sure that you have the file, tickets.csv in the same folder as your program). It should print out all the information about all the tickets issued. Let's focus in on just licence plates. Change the last line of your program to be:

print(tickets["Plate ID"])	#Print out licence plates

When you run the program again, you should just see the row number and licence plates.

We want to refine this further to print how many tickets each car got. Pandas has a function just for counting occurrences, called value_counts(). Let's modify our last line again to use it:

print(tickets["Plate ID"].value_counts())	#Print out plates & number of tickets each got

Rerunning the program, there are a lot of cars that got only a single ticket. If you scroll back up the Python shell, you will see the cars with the most tickets are listed first. Let's just print out the 10 cars that got the most tickets. We can do this by slicing to [:10]:

print(tickets["Plate ID"].value_counts()[:10])	#Print 10 worst & number of tickets

Even with only 1000 lines of ticket information, there is a car (with plate TOPHAT5) that got more than 5 tickets.

Let's make our program a bit more general, to allow the user to enter their own file name:

#CSci 127 Teaching Staff
#October 2017
#Count which cars got the most parking tickets

#Import pandas for reading and analyzing CSV data:
import pandas as pd

csvFile = input('Enter CSV file name: ')         #Name of the CSV file
tickets = pd.read_csv(csvFile)     #Read in the file to a dataframe
print("The 10 worst offenders are:")
print(tickets["Plate ID"].value_counts()[:10]) #Print out the dataframe

And run it on all tickets for the 20th precinct for January 2016 (14,000 tickets): Parking_Violations_Jan_2016.csv.

You should see output:

Enter CSV file name: Parking_Violations_Jan_2016.csv
The 10 worst offenders are:
63044JM    52
63277JM    46
63540JM    42
93503JT    36
42816JM    35
97223JE    35
62150JM    35
31420MG    32
23246MA    31
AP113R     30
Name: Plate ID, dtype: int64

For just the month of January 2016, there were 9 cars that got more than a ticket a day.

Challenges: Binning Other Data

Now that you have a program to use as a basic template, answer the following questions:

What color of car is most likely to get a ticket?
What type of license gets the most tickets?
Which location yields the most tickets?

Hunter College OpenData Showcase:
Structured Data & Pandas

Reading Structured Data: Parking Tickets

Using Python

Binning Data: Parking Tickets

CSV Data Files

Counting Tickets per Car

Challenges: Binning Other Data

What's Next?

Hunter College OpenData Showcase: Structured Data & Pandas

Reading Structured Data: Parking Tickets

Using Python

Binning Data: Parking Tickets

CSV Data Files

Counting Tickets per Car

Challenges: Binning Other Data

What's Next?

Hunter College OpenData Showcase:
Structured Data & Pandas