Homework #6

CMP 464-C401/MAT 456-01:
Topics Course: Data Science
Spring 2016

Topics: A/B Testing, Simple Classifers
Deadline: Thursday, 17 March 2016, 10:30am

Textbook's Code

The computer science assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:

https://github.com/joelgrus/data-science-from-scratch/blob/master/code/naive_bayes.pyz
You may also find the hypothesis and inference code useful:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code/hypothesis_and_inference.py

Datasets

This assignment uses the following datasets:

Social Security Name Dataset

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.

CMP 464 Homework: MAT 456 Homework:
#1 The Department of Transportation (DOT), as part of Vision Zero, is interested in reducing accidents and speeding on roadways across the city. They are interested in what signage has a larger affect on speeding. They collected data with two different messages: the first sign says "Speeding Kills" and the second sign gave the speed that the car was moving. Data is collected for both signs:
  • For the first ("Speeding Kills" sign), 140 out of 1200 cars were observed going the speed limit.
  • For the second (sign with current speed), 150 out of 1100 cars were observed going the speed limit.
The second seems more effective. Could this have happened by chance? What is the probability that you would see such a difference if the signs were equally effective at slowing traffic? Justify your answer.

#1: Submit a .pdf or .png file of your neatly handwritten or typed answer.
#2-7 Write a classifer program that predicts if a name is boy or girl's name based on the last letters of the name.
#2: Write a Python program that takes as input a Social Security Administration name file (see above for files and format) and outputs three files. The first file should have 26 lines (one for each letter of the alphabet). Each line contains three values: the letter, the fraction of boys' names ends in that letter in the training set (inputted file), the fraction of girls' names that in that letter in the training set. For example: a possible file could start:
a, 0.023, 0.451
b, 0.010, 0.008
...
The second file computes the fraction of boys and girls names that has the second to last letter specified.
The third file computes the fraction of boys and girls names that has the third to last letter specified.
#2: Submit your Python program that computes correlation and produces the graph as a .py file.

#3: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file.

#4: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file.

#5: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file.

#6: Modify the Naive Bayes spam filter program to reads in the three files generated above as well as a fourth file of test data (use another year's data such as yob2010.txt). Instead of multiplying together word occurrences, your program should multiply use the last letter, second-to-last letter, and third-to-last letter that you computed above. Your program should classify each name in the test data (similar to the Naive Bayes filter from the book) and report back the percentage of names your correctly predicted as well as the names you predicted incorrectly.
Submit your Python program as a .py file.

#7: Submit the output of your file (the percentage you correctly predicted along with the names you predicted incorrectly) as a .txt file.
#2: A popular business analytics company, unbounce, provides this example of A/B test:
  • A: 385 visitors, 9 conversions, and
  • B: 385 visitors, 20 conversions.
Assuming the null hypothesis that the underlying success rates for variations A and B are equal, what is the probability that you would observe this data? Show your work.
Submit a .pdf or .png image of your neatly handwritten or typed answer.

#3: What if you saw similar proportions as in #2 but only ran the experiment for a tenth as many pages:
  • A: 39 visitors, 1 conversions, and
  • B: 39 visitors, 2 conversions.
Assuming the null hypothesis that the underlying success rates for variations A and B are equal, what is the probability that you would observe this data? Does your answer differ from #2. Explain.
Submit a .pdf or .png image of your neatly handwritten or typed answer.

#4: What if you saw similar proportions as in #2 but only ran the experiment for a ten times as many pages:
  • A: 3850 visitors, 90 conversions, and
  • B: 3850 visitors, 200 conversions.
Assuming the null hypothesis that the underlying success rates for variations A and B are equal, what is the probability that you would observe this data? Does your answer differ from #2 and #3. Explain.
Submit a .pdf or .png image of your neatly handwritten or typed answer.

#5: VWO, a popular web service for business analytics, describes A/B testing of showing visitors two different webpages:
  • Variation A has 23% conversion (23% of the visitors clicked on the page), and
  • Variation B has 11% conversion (11% of the visitors clicked on the page).
For this experiment 50% of the visitors were shown Variation A and 50% of the visitors were shown Variation B. What sample size, N, (i.e. what is the number of pages shown) to say that this was unlikely to happen by chance (i.e. with p-value < 0.05)? Use the A/B test statistic (discussed in lecture and in the book) to compute N.
Submit a .pdf or .png image of your neatly handwritten or typed answer.

#6: The Obama campaign used A/B testing to redesign the basic elements of their campaign website. A major goal was to increase the number of visitors who signed up with their emails. Replacing the default "Sign Up" with "Learn More" increaces the number of signups per visitor by 18.6 percent. What sample size was needed to say that this was unlikely to occur by chance? That is, if the null hypothesis is that the two slogans are equally effective, what sample size yields a p-value < 0.05? Show your work.
Submit a .pdf or .png image of your neatly handwritten or typed answer.

#7: Exploring the class website, you find the central limit theorem, visualized you decide to run it with bins = 3. After 100 iterations, you have
  • 35% in the first bin (the ball went "left" then "left"),
  • 42% in the second bin (the ball went "left, then right" or "right, then left"), and
  • 23% in the third bin (the ball went "right, then right").
Assuming that after each left is a 0 and each right is a 1, then the first bin would have value 0, second bin value 1, and the third bin value 2:
  • What is the average value of the sample you collected?
  • What is the standard deviation of the sample?
  • Assuming that going left or right is equally likely, what is the probability of seeing the sample that you collected?
Submit a .pdf or .png image of your neatly handwritten or typed answer.

Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.