The computer science assignment uses the Logistic Regression code developed by the textbook's author and available at:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code/logistic_regression.py
For this assignment, you will need to download the following data sets:
https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9For this assignment, we will focus on calls that reported potholes from January until April. You can use the filtering options on the website (filter for Descriptor = "Pothole"), use Excel, or write a quick Python script to trim the data set to just be potholes reported from January 1 2016 onward.
https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/unemployment1948.csv
We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.
The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.
CMP 464 Homework: | MAT 456 Homework: | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
#1-3 |
The file landmarkDistances.txt contains a distance matrix for 10 landmarks in New York City.
The file landmarks.txt contains the names of 9 of the landmarks. The last name is "mystery". Use multidimensional scaling to display all 10 points on the screen with landmarks labeled. Using your displayed information, identify the last mystery landmark. (Based on the inspired homework from dasGupta's machine learning course at UCSD.) Make sure to include in the title of your plot the date plotted. #1: Submit your Python program as a .py file. #2: Submit a screen shot of the graphics window containing the plot. #3: Does the plot look right? If not, how can you re-orient so that it does? What is the mystery landmark? Include your answer in a .txt or a scan of a neatly handwritten paper. |
|||||||||
#4-5 |
Using the 311 Pothole data (described above), use logistic regression analysis of time versus status: "Closed" versus "Open" (any non-"Closed" values of the status include in the "Open" category) to create a model that predicts if a pothole has been repaired, given the date reported.
You may want to use the textbook's code (described above)-- if you do, modify to take only 1 input parameter (instead of the 2 it currently takes). This is a very small change (just change the starting beta_0 to match the dimension of the data and be [1,1].)
|
#6-7 |
Using bokeh, display the unemployment data (described above) as a scatter plot of the data with a 6-month running average.
Note: The bokeh part is straightforward (very similar to the
stock example-- you need to modify the #prepare some data section (as well as titles and labels)). The work here is setting up your data to be plotted. The dates need to be datetime objects for the stretching and scaling of the x-axis to work.
|
Given the data:
and two possible logistic functions:
Submit a typeset or neatly handwritten answer. #7: Use the book's logistic regression code to fit a logistic function to the data above. Note: The book's code assumes that 3 dimensional data while we have only 2 dimensions. The code is well-written and can be easily modified to handle only 2 dimensions by entering 2 dimensional data for data and modifying the starting value of beta_0 to be the [1,1]. Submit a screenshot of a graph with the true points, your predicted curve (i.e. the logistic function with your computed beta_hat, and the f and g functions above. Include a legend or labels for your curves. |