Homework #5

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016

Topics: Bayes Theorem, Simpson's Paradox, & Regular Expressions
Deadline: Thursday, 10 March 2016, 10:30am

Textbook's Code

This assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:

https://github.com/joelgrus/data-science-from-scratch/blob/master/code/naive_bayes.pyz

Datasets

This assignment uses the following datasets:

The Social Security Administration keeps track of the most popular names given each year as well as by state. For this assignment, you will need 21 years of state name data. You can use the nystate.tar.zip file for 1990 to 2010 for New York state, or you may download a different state (or time range) from the SSA data page.

Spam Data

This assignment uses data collected and made publicly by Apache, and can be found at:

http://spamassassin.apache.org/publiccorpus/

For this assignment, you will need to download three different data sets:

  1. 20021010_easy_ham.tar.bz2
  2. 20021010_hard_ham.tar.bz2
  3. 20021010_spam.tar.bz2
(If you are on a Windows machine, you might need a program like 7-Zip to decompress and extract the data files.)

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.

CMP 464 Homework: MAT 456 Homework:
#1-2 Use regular expressions to search for name occurrences in the Social Security Administration data. Choose a name that can be spelled 3 or more ways (for example, "Katherine" has alternative spellings of "Catherine", "Katharine", "Catharine", "Kathryn", etc.). Use regex to combine the totals from different spellings and graph over 21 years of state data.

#1: Submit your Python program that computes correlation and produces the graph as a .py file.
#2: Submit a screen shot of the graphics window containing the plot. Your graph should contain both datasets shown on the same plot, using different axis for each. Include the correlation in the label of the graph.
#3-4 Are collisions correlated to temperature? Limit your zipcode data set to dates just in January 2016 (either write a quick filter program or download again with limited dates). Using a dictionary structure of your choice, count the number of collisions that occurred in your zipcode on each day. On same plot, plot the number of collisions and the daily temperature (see twoPlots.py for graphing plots with different scales on same image).

#3: Submit your Python program as a .py file.
#4: Submit a screen shot of the graphics window containing the plot.
Submit your files as scans of (neatly) written answers or pdf-latex files:

#3: Say the number of students applying to graduate programs over the last 5 years has decreased by 2%, but the number of students applying to computer science is up by 3.9%, mathematics up by 4.5%, statistics up by 10.1%, and psychology up by 1%. How is this possible? Give population numbers for now and 5 years ago that could support these numbers.

#4: Assume a rare disease occurs in 1 out of every 10,000 people. A test has been developed that is 99.5% accurate. That is, if you have the disease, it comes back positive 99.5% of the time and negative 0.5% of time. Similarly, if you do not have the disease, it comes back positive 0.5% of the time and negative 99.5% of the time. You have just tested positive for the disease. With what probability do you have the disease? Justify your answer.

#5: You have two bags of candy: one is half reds and half blues. The other is two third reds and a third blues. You choose a bag at random, and then a candy at random. The candy is red. What is the probability it came from the first bag? Justify your answer.

#6: Assume you flip a coin repeatedly until 2 tosses in a row are tails. If the probability that the coin comes up is heads is some number 0 < p < 1. What is the probability that the experiment ends on the nth toss?

Hints:
  • The coin tosses can be viewed as Bernoulli trials with probability of succes p.
  • Work out the probabilities for small cases to see the pattern.
  • Think about how to break down the (n+1)st case in terms of previous cases.
#5-6 Extend the textbook's analysis of the spam data set to count plurals (all words of 4 or more characters that end in a single s) and -est words (all words of 5 or more letters ending in -est to count as the base word). See the discussion in the textbook.

#5: Submit your Python program as a .py file.
#6: Submit a text file with your results and conclusion-- how much more spam did you find with this extension? Did this increase the amount of real mail ('ham') that was identified as spam?

Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.