HW #12, Data Science at Lehman College, CUNY, Spring 2016

Data

For this assignment, you will need to download the following data sets:

Reported Potholes in NYC: Through the NYC Open Data project, a summary of 311 calls are available:
https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
For this assignment, we will focus on calls that reported potholes from January until April. You can use the filtering options on the website (filter for Descriptor = "Pothole"), use Excel, or write a quick Python script to trim the data set to just be potholes reported from January 1 2016 onward.
Unemployment Data: Bokeh has a demonstration dataset of unemployment data:
https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/unemployment1948.csv

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.

CMP 464 Homework: MAT 456 Homework:

#1-3 The file landmarkDistances.txt contains a distance matrix for 10 landmarks in New York City. The file landmarks.txt contains the names of 9 of the landmarks. The last name is "mystery". Use multidimensional scaling to display all 10 points on the screen with landmarks labeled. Using your displayed information, identify the last mystery landmark.
(Based on the inspired homework from dasGupta's machine learning course at UCSD.)

Make sure to include in the title of your plot the date plotted.

#1: Submit your Python program as a .py file.
#2: Submit a screen shot of the graphics window containing the plot.
#3: Does the plot look right? If not, how can you re-orient so that it does? What is the mystery landmark? Include your answer in a .txt or a scan of a neatly handwritten paper.

#4-5

Using the 311 Pothole data (described above), use logistic regression analysis of time versus status: "Closed" versus "Open" (any non-"Closed" values of the status include in the "Open" category) to create a model that predicts if a pothole has been repaired, given the date reported.

You may want to use the textbook's code (described above)-- if you do, modify to take only 1 input parameter (instead of the 2 it currently takes). This is a very small change (just change the starting beta_0 to match the dimension of the data and be [1,1].)

#4: Submit your Python program as a .py file.
#5: Submit a screen shot of the graphics window containing a plot of the data points (a scatterplot of (date,status) where status is 0 if closed and 1 otherwise) with the logistic curve that you fit to the data.

#6-7

Using bokeh, display the unemployment data (described above) as a scatter plot of the data with a 6-month running average.

Note: The bokeh part is straightforward (very similar to the stock example-- you need to modify the #prepare some data section (as well as titles and labels)). The work here is setting up your data to be plotted. The dates need to be datetime objects for the stretching and scaling of the x-axis to work.

Make sure to include a title in your plot.

#6: Submit your Python program as a .py file.
#7: Submit a screen shot of your program as an .png or .jpg file. (This homework originally asked for the .html file, but Blackboard is stripping the HTML formatting out of submitted files.)

Given the data:

x 0 1 2 4 5 7 8 9 12 15

y 0 0 0 1 1 0 1 0 1 1

and two possible logistic functions:

f(x) = 1/(1 +e^{-(6+.75x)})
g(x) = 1/(1 +e^{-(-8+x)})

#6: Which function predicts more of the data correctly? That is, out of the 10 input values, how many values were predicted to be 1 with greater than 50% probability were actually 1 (true positives), were actually 0 (false positives); how many values were predicted to be 0 with greater than 50% probability were actually 0 (true negatives), were actually 1 (false negatives)?

Submit a typeset or neatly handwritten answer.

#7: Use the book's logistic regression code to fit a logistic function to the data above.
Note: The book's code assumes that 3 dimensional data while we have only 2 dimensions. The code is well-written and can be easily modified to handle only 2 dimensions by entering 2 dimensional data for data and modifying the starting value of beta_0 to be the [1,1].

Submit a screenshot of a graph with the true points, your predicted curve (i.e. the logistic function with your computed beta_hat, and the f and g functions above. Include a legend or labels for your curves.

Homework #12

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016

Textbook's Code

Data

Assignment

Submitting Homework

Homework #12

CMP 464-C401/MAT 456-01: Topics Course: Data Science Spring 2016

Textbook's Code

Data

Assignment

Submitting Homework

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016