Homework #3

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016

Topics: Binning Data & Measuring Dispersion; NYC Collision Data
Deadline: Thursday, 25 February 2016, 10:30am

New York City Collision Data

This assignment uses collision data collected and made publicly by New York City Open Data, and can be found at:

https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95.

For this assignment, you will need to download two different data sets:

  1. Collisions on your birthday: Using the "Filter" option, choose your birthday in 2015 and "Export" (in CSV format) all collisions for that day.
  2. Collisions for a zip code: Pick a zip code for somewhere in New York City (if you cannot think of one, you can use "10468" which is the zip code for the area including Lehman College). Filter with this zip code and export this file also as a CSV file.

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

CSV Data Files

CSV files store tabular information in readable text files. The files downloaded above have information separated by commas (using tabs as delimiters is also common). Here is a sample line:
02/01/2016,0:09,BRONX,10465,40.8341548,-73.8174815,"(40.8341548, -73.8174815)",BARKLEY AVENUE,DEAN AVENUE,,0,0,0,0,0,0,0,0,Driver Inattention/Distraction,Driver Inattention/Distraction,,,,3381301,PASSENGER VEHICLE,PASSENGER VEHICLE,,,

All lines are formatted similarly: they start with the date, then time, the borough, zip code, latitude and longitude, and also include cross streets, types of vehicles involved, number of injuries/fatalities, and possible cause. The first line of the file gives the entries in the order they occur in the rows.

The sample entry above gives details for a crash that occurred just past midnight at the corner of Barkley and Dean Avenues. There were no injuries and two passenger vehicles were involved. The probable cause was driver inattention on the part of both drivers. Each entry also includes a unique key that can be used to look up the report of the incident.

The textbook has a nice explanation (p 107 & sample code, line 161) of using the CSV module. You should use that as a basis for the programs below that take CSV files as input.

Assignment

The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.

CMP 464 Homework: MAT 456 Homework:
#1-2 Using the birthday data set (see above), display a histogram of the number of collisions that occur each hour. That is, your x-axis will have the hours from 0 to 23 and the y-axis will be the number of collisions. Make sure to include in the title of your plot the date plotted.

#1: Submit your Python program as a .py file.
#2: Submit a screen shot of the graphics window containing the plot.


Hint: In this file, times of collisions are stored as "HH:MM". To get the hour to use a key for your dictionary, you can slice up the time string. If times are stored in the list time, time[i][:2] will give the first two characters (i.e. those that store the hour part of time) of the ith entry.
#3-4 Using the birthday data set, display the fraction of collisions that occur in the Bronx each hour. That is, for 0 (midnight to just before 1am), you should have as your y-value the fraction of: collisions that occurred in the Bronx at hour 0 over collisions across the whole city that occurred at hour 0.

#3: Submit your Python program as a .py file.
#4: Submit a screen shot of the graphics window containing the plot.

Hint: To extract parts of a field, for example to find out the hour of a time written as "H:MM" or "HH:MM", you can first find the location of the ":" in the string and then use it. For example, if timeString holds the string, then c = timeString.find(":") finds the location, and hour = int(timeString[:c]) will give the hour as an integer.
Using the binned data from #1, compute the mean and variance of hour that collisions occur.

#3: Submit your Python program as a .py file.
#4: Submit the output of your python program as a text file or screen shot of the shell output.
#5-6 Using the zip code data set (see above), display a histogram of the number of collisions that occur each month. That is, your x-axis will have the numbers from 1 to 12 representing the months of the year and the y-axis will be the number of collisions. Make sure to include in the title of your plot the zip code plotted.

#5: Submit your Python program as a .py file.
#6: Submit a screen shot of the graphics window containing the plot.

Hint: Since all dates occur in the same format: "MM/DD/YYYY", you can extract the month from dateString by monthNum = int(dateString[:2]).

Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.