Homework #10

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016

Topics: Nearest Neighbors
Deadline: Thursday, 14 April 2016, 10:30am

Textbook's Code

For this assignment, the following code from the textbook will be useful:

Data

For this assignment, you will need to download two different data sets:

  1. Bachelors Degrees Awarded: The bokeh sample data repository includes a CSV file with the percentage of bachelors degrees awarded to women, by major, by year:
    https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/percent-bachelors-degrees-women-usa.csv
  2. NYC Trash Cans: The NYC open data project keeps an inventory of all public litter basket locations in the city:
    https://data.cityofnewyork.us/Environment/Litter-Basket-Inventory/es7t-6u8y

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.

CMP 464 Homework: MAT 456 Homework:
#1-2 Using the bokeh data set on percentage of bachelors degrees awarded to women, create a single display with the data for computer science and for mathematics/statistics majors from 1970 to 2011.

#1: Submit your Python program as a .py file.
#2: Submit the .html file produced by your program.

Hint: Use the stock example covered in class as a template. There is no need to include in your plot the running average, since there's not enough data points to warrant it.
Note: The link for the stocks example above is the one used in class. A newer version is available here.
#3-4 The nearest_neighbor.py plots the favorite programming languages by city, but does not draw the state or country borders. Modify his program to include the state and country borders.

#3: Submit your Python program as a .py file.
#4: Submit a screen shot of the graphics window containing the plot.

Hint: You may use packages such as basemap. Other than import statements, you only modify the plot_state_borders() function.
#5-6 The OpenData project keeps track of the public trash can locations. Where are the 10 loneliest trash cans in New York City? For each trash can location in the file, compute and store the distance to its nearest neighbor (i.e. store the location of each trash can, and then loop through that list for each trash can to build up a second list of minimum distances). Sort your list and plot the 10 trash cans with the largest distance to their nearest neighbor.

Make sure to include in the title in your plot.
#5: Submit your Python program as a .py file.
#6: Submit a screen shot of the graphics window containing the plot.
A metric on a set X is a function (called the distance function or simply distance), d: X x X → R+, where R+ is the set of non-negative real numbers (because distance can't be negative so we can't use R), and for all x, y, z in X, the following conditions are satisfied:
  1. d(x,y) >= 0
  2. d(x,y) = 0 if and only if x=y
  3. d(x,y) = d(y,x)
  4. d(x,z) <= d(x,y) + d(y,z)

#5: The L-infinity distance (maximum distance) is a metric, is the following:

d_m(p,q) = min{|p_1-q_1|,|p_2-q_2|,...|p_n-q_n|}

Why or why not? If yes, show that it satisfies the 4 properties above. If not, give a counter-example to one of the 4 properties.
Submit a .png or .pdf file of your typeset or neatly handwritten answer.

#6: Show that the unit cube under the L-1 norm is contained in the L-2. That is, any point that is distance 1 from the origin under the L-1 norm, is also distance 1 under the L-2 norm.
Submit a .png or .pdf file of your typeset or neatly handwritten answer.

Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.