The computer science assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code/naive_bayes.pyzYou may also find the hypothesis and inference code useful:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code/hypothesis_and_inference.py
We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.
The work to be submitted differs by whether you are enrolled in the computer science or mathematics course.
CMP 464 Homework: | MAT 456 Homework: | |
---|---|---|
#1 |
The Department of Transportation (DOT), as part of Vision Zero, is interested in reducing accidents and speeding on roadways across the city. They are interested in what signage has a larger affect on speeding. They collected data with two different messages: the first sign says "Speeding Kills" and the second sign gave the speed that the car was moving. Data is collected for both signs:
#1: Submit a .pdf or .png file of your neatly handwritten or typed answer. | |
#2-7 |
Write a classifer program that predicts if a name is boy or girl's name based on the last letters of the name.
#2: Write a Python program that takes as input a Social Security Administration name file (see above for files and format) and outputs three files. The first file should have 26 lines (one for each letter of the alphabet). Each line contains three values: the letter, the fraction of boys' names ends in that letter in the training set (inputted file), the fraction of girls' names that in that letter in the training set. For example: a possible file could start: a, 0.023, 0.451 b, 0.010, 0.008 ...The second file computes the fraction of boys and girls names that has the second to last letter specified. The third file computes the fraction of boys and girls names that has the third to last letter specified. #2: Submit your Python program that computes correlation and produces the graph as a .py file. #3: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file. #4: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file. #5: Run your program with yob2009.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file. #6: Modify the Naive Bayes spam filter program to reads in the three files generated above as well as a fourth file of test data (use another year's data such as yob2010.txt). Instead of multiplying together word occurrences, your program should multiply use the last letter, second-to-last letter, and third-to-last letter that you computed above. Your program should classify each name in the test data (similar to the Naive Bayes filter from the book) and report back the percentage of names your correctly predicted as well as the names you predicted incorrectly. Submit your Python program as a .py file. #7: Submit the output of your file (the percentage you correctly predicted along with the names you predicted incorrectly) as a .txt file. |
#2: A popular business analytics company, unbounce, provides this example of A/B test:
Submit a .pdf or .png image of your neatly handwritten or typed answer. #3: What if you saw similar proportions as in #2 but only ran the experiment for a tenth as many pages:
Submit a .pdf or .png image of your neatly handwritten or typed answer. #4: What if you saw similar proportions as in #2 but only ran the experiment for a ten times as many pages:
Submit a .pdf or .png image of your neatly handwritten or typed answer. #5: VWO, a popular web service for business analytics, describes A/B testing of showing visitors two different webpages:
Submit a .pdf or .png image of your neatly handwritten or typed answer. #6: The Obama campaign used A/B testing to redesign the basic elements of their campaign website. A major goal was to increase the number of visitors who signed up with their emails. Replacing the default "Sign Up" with "Learn More" increaces the number of signups per visitor by 18.6 percent. What sample size was needed to say that this was unlikely to occur by chance? That is, if the null hypothesis is that the two slogans are equally effective, what sample size yields a p-value < 0.05? Show your work. Submit a .pdf or .png image of your neatly handwritten or typed answer. #7: Exploring the class website, you find the central limit theorem, visualized you decide to run it with bins = 3. After 100 iterations, you have
|