CSci 39542 Syllabus    Resources    Coursework

## Coursework CSci 39542: Introduction to Data ScienceDepartment of Computer ScienceHunter College, City University of New York Fall 2021

Quizzes    Homework    Project

## Quizzes

Unless otherwise noted, quizzes are available on Blackboard for the 24 hours after lecture. Blackboard quizzes are 15 minutes long and can be repeated up to the deadline. The highest score earned on a Blackboard quiz will be reported. Blackboard access is generated automatically from the registrar. See the ICIT Blackboard page for resources and tutorials for using the system.

Five of the quizzes assess your programming skill using HackerRank. These quizzes are 30 minutes long and cannot be repeated. Links will be available on Blackboard to access the quiz.

There are no make-up quizzes. Instead, your score on the final exam will replace missing quiz grades (the final exam will also replace a quiz grade when you take the quiz but do better on the final exam). See the syllabus for additional information on how grades are calculated.

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.

Quiz 2: Due 4pm, Tuesday, 31 August.   The second quiz focuses on the Python Recap: basics and standard packages (pandas, numpy, matplotlib, & seaborn), zips, and list comprehensions from Lecture 1.

Quiz 3: Due 4pm, Friday, 3 September.  The quiz covers data sampling from the third lecture and the reading: DS 100: Chapter 2 (Theory for Data Design) and includes Python review questions.

Quiz 4: Due 4pm, Friday, 10 September.  The quiz covers Python string methods and Python data types from the second and third lectures and the reading: DS 100: Section 13.1 (Python String Methods) and subsetting dataframes from DS 100: Chapter 7 (Data Tables in Python).

Quiz 7: Due 4pm, Friday, 24 September.  This quiz covers SQL from Lectures #4, 5 & 6 and DS 100: Chapter 7 (Relational Databases & SQL).

Quiz 9: Due 4pm, Friday, 1 October.  The focus is functions in Python, covered in the code demos in Lectures #5 and #6.

Quiz 13: Due 4pm, Tuesday, 19 October.  The focus of this quiz is probability and risk, covered in Lecture #12, DS 8: Chapter 9 (Randomness), and DS 100: Chapter 16 (Probability & Generalization).

Quiz 15: Due 4pm, Tuesday, 26 October.  This is a coding quiz on HackerRank. You will be sent an invitation to the email you use for Gradescope for this quiz. You have 30 minutes to complete the quiz, and the quiz cannot be repeated.

More to come...

## Homework

Unless otherwise noted, programs are submitted on the course's Gradescope site and are written in Python. Also, to receive full credit, the code should be compatible with Python 3.6 (the default for the Gradescope autograders).

All students registered by Monday, 23 August were sent a registration invitation to the email on record on their Blackboard account. If you did not receive the email or would like to use a different account, post to Help::Individual Questions (on the left hand menu when logged into the course site on Blackboard). Include in your post that you not receive a Gradescope invitation, your preferred email, and we will manually generate an invitation. As a default, we use your name as it appears in Blackboard/CUNYFirst (to update CUNYFirst, see changing your personal information). If you prefer a different name for Gradescope, include it in your post, and we will update the Gradescope registration.

To get full credit for a program, the file must include in the opening comment:

• The email you are using for Gradescope.
• A list of any resources you used for the program. Include classmates and tutors that you worked with, along with any websites or tutorials that you used. If you used no resources (other than the class notes and textbooks), then you should include the line: "No resources used."
For example, for the student, Thomas Hunter, the opening comment of his first program might be:

"""
Name:  Thomas Hunter
Email: thomas.hunter.1870@hunter.cuny.edu
Resources:  Used python.org as a reminder of Python 3 print statements.
"""

and then followed by his Python program.

Set 1: The first set of programs recaps familiar Python constructs and packages. None are challenging, instead, their purpose is as review and to ensure your Python IDE is functional, has the basic libraries and that you can submit programs to Gradescope.

Program 1: Hello, world.Due noon, Friday, 27 August.
(Learning Objective: students are able to use a Python IDE on their computer and successfully submit the work to the Gradescope system.)

Submit a Python program that prints: Hello, world

Write a program, using the pandas package, that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Next, the program should select rows where the field senate_class is non-empty and write the first_name and last_name to a file with the output file name provided by the user.

For example, if the file was legislators-current.csv with the first 3 lines of:


Brown,Sherrod,,,,Sherrod Brown,1952-11-09,M,sen,OH,,1,Democrat,https://www.brown.senate.gov,503 Hart Senate Office Building Washington DC 20510,202-224-2315,http://www.brown.senate.gov/contact/,http://www.brown.senate.gov/rss/feeds/?type=all&,SenSherrodBrown,SenatorSherrodBrown,SherrodBrownOhio,UCgy8jfERh-t_ixkKKoCmglQ,B000944,00136,N00003535,S307,"H2OH13033,S6OH00163",5051,400050,27018,Sherrod Brown,,29389,Sherrod Brown
Cantwell,Maria,,,,Maria Cantwell,1958-10-13,F,sen,WA,,1,Democrat,https://www.cantwell.senate.gov,511 Hart Senate Office Building Washington DC 20510,202-224-3441,http://www.cantwell.senate.gov/public/index.cfm/email-maria,http://www.cantwell.senate.gov/public/index.cfm/rss/feed,SenatorCantwell,senatorcantwell,SenatorCantwell,UCN52UDqKgvHRk39ncySrIMw,C000127,00172,N00007836,S275,"S8WA00194,H2WA01054",26137,300018,27122,Maria Cantwell,,39310,Maria Cantwell

Then a sample run of the program:
Enter input file name: legislators-current.csv
Enter output file name:  senatorNames.csv

And the first three lines of senatorNames.csv would be:

first_name,last_name
Sherrod,Brown
Maria,Cantwell

Note: if you use the legislators CSV file above, your output file should have 101 lines: 1 line of header information and 100 rows of data.

Program 3: Senators' Ages.Due noon, Wednesday, 1 September.
(Learning Objective: to refresh students' knowledge of Pandas' functionality to create new columns from existing columns of formatted data.)

Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Next, the program should select rows where the field senate_class is non-empty and write the first_name and compute the age based on the birthday field as of the first of the year. Your program should write out a new CSV file (with the name provided by the user) with the two columns: first_name and age.

For example, if the file was legislators-current.csv with the first 3 lines of:


Brown,Sherrod,,,,Sherrod Brown,1952-11-09,M,sen,OH,,1,Democrat,https://www.brown.senate.gov,503 Hart Senate Office Building Washington DC 20510,202-224-2315,http://www.brown.senate.gov/contact/,http://www.brown.senate.gov/rss/feeds/?type=all&,SenSherrodBrown,SenatorSherrodBrown,SherrodBrownOhio,UCgy8jfERh-t_ixkKKoCmglQ,B000944,00136,N00003535,S307,"H2OH13033,S6OH00163",5051,400050,27018,Sherrod Brown,,29389,Sherrod Brown
Cantwell,Maria,,,,Maria Cantwell,1958-10-13,F,sen,WA,,1,Democrat,https://www.cantwell.senate.gov,511 Hart Senate Office Building Washington DC 20510,202-224-3441,http://www.cantwell.senate.gov/public/index.cfm/email-maria,http://www.cantwell.senate.gov/public/index.cfm/rss/feed,SenatorCantwell,senatorcantwell,SenatorCantwell,UCN52UDqKgvHRk39ncySrIMw,C000127,00172,N00007836,S275,"S8WA00194,H2WA01054",26137,300018,27122,Maria Cantwell,,39310,Maria Cantwell

Then a sample run of the program:
Enter input file name: legislators-current.csv
Enter output file name:  senatorAge.csv

And the first three lines of senatorAge.csv would be:

first_name,age
Sherrod,68
Maria,62

since that was their ages as of the start of the year: January 1, 2021.

Note: if you use the legislators CSV file above, your output file should have 101 lines: 1 line of header information and 100 rows of data.

Program 4: ELA Proficiency.Due noon, Thursday, 2 September.
(Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting data.)

Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Next, the program should select rows where the field Grade is equal to 3 and the Year is equal to 2019 and write all rows that match that criteria to a new CSV file.

Then a sample run of the program:

Enter input file name: school-ela-results-2013-2019.csv
Enter output file name:  ela2013.csv

where the file school-ela-results-2013-2019.csv is extracted from NYC Schools Test Results (and truncated version of roughly the first 1000 lines for testing). The first lines of the output file would be:

School,Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
01M015,P.S. 015 ROBERTO CLEMENTE,3,2019,All Students,27,606,1,3.7,7,25.9,18,66.7,1,3.7,19,70.4
01M019, P.S. 019 ASHER LEVY,3,2019,All Students,24,606,0,0.0,8,33.3,15,62.5,1,4.2,16,66.7
01M020,P.S. 020 ANNA SILVER,3,2019,All Students,57,593,13,22.8,24,42.1,18,31.6,2,3.5,20,35.1


Set 2: The second set of programs focuses on incorporating and analyzing rectangular data, in terms of relational databases and data frames. The goal is familiarity with these canonical representations to use as building blocks for future analysis, programs, and your project.

Program 5: URL Collection.Due noon, Friday, 3 September.
(Learning Objective: to use regular expressions with simple patterns to filter column data in a canonical example: scraping a website of URL's.)

Write a program that asks the user for the name of an input HTML file and the name of an output CSV file. Your program should use regular expressions (see Chapter 12.4 for using the re package in Python) to find all links in the input file and store the link text and URL as columns: Title and URL in the CSV file specified by the user. For the URL, strip off the leading https:// or http:// and any trailing slashes (/):

For example, if the input file is:


<html>

<body>
<p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>.  </p>

</body>
</html>

Then a sample run of the program:
Enter input file name: simple.html

And the links.csv would be:

Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com

Program 6 is cancelled. See announcement on Blackboard.

Program 6: Regex on Restaurant Inspection Data.Due noon, Thursday, 9 September.
(Learning Objective: The two learning objectives of this exercise are a) to give the students an opportunity to practice their newfound regular expressions (regex) skills and b) familiarize them with the restaurant inspection dataset which would be used again in the latter SQL programs.)

Use regular expressions (covered in Lecture 3 & DS 100: Sections 12.2-3) to clean restaurant inspection datasets that we will use in later SQL programs. Your program should:

• Ask the user for names of an input file and output file. You can assume that the input file is a CSV file with column names from the restaurant inspection datasets. Your program should open the named CSV input file and save the results in the named output file.
• There are many invalid phone numbers in the data set. Instead of 10 digits, e.g. 2125551212, some entries have extra characters and not enough digits, e.g. 212-555-12. First, use regex to check that 10 digits were entered. If not, set the value to "" (empty string) since the current data is corrupt. If the phone number does have 10 digits, use regex to parse out the given 10 digit phone number DDDDDDDDDD and re-concatenate it back into a 11 digit format: 1-digit country code (as 1), 3-digit area code, and a 7-digit telephone number, so that the final output looks like: +1-DDD-DDD-DDDDD.
• With the inspection date column, use regex to break out the year, month, and day component, then re concatenate the MM/DD/YYYY to YYYY/MM/DD.
• With the restaurant name column DBA, please create a new column called "restaurant_name" which reformats the data in the original column from all upper case to camel case. Separately, create a new Boolean column that flags True if the word Thai (not case sensitive) appears in the original restaurant name.

For example, if the file was restaurants30July.csv with the first 3 lines of:


CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
41178124,CAFE 57,Manhattan,300,WEST   57 STREET,10019,2126492729,American,7/30/2021,Violations were cited in the following area(s).,09C,Food contact surface not properly maintained.,Not Critical,4,A,7/30/2021,8/1/2021,Cycle Inspection / Initial Inspection,40.76643902,-73.98332508,104,3,13900,1025451,1010477502,MN15
50111450,CASTLE CHICKEN,Bronx,5987A,BROADWAY,10471,9178562047,Chicken,7/30/2021,Violations were cited in the following area(s).,05D,Hand washing facility not provided in or near food preparation area and toilet room. Hot and cold running water at adequate pressure to enable cleanliness of employees not provided at facility. Soap and an acceptable hand-drying device not provided.,Critical,41,N,,8/1/2021,Pre-permit (Operational) / Initial Inspection,40.88993027,-73.89805316,208,11,28500,2084208,2058011033,BX29
40699339,NICK GARDEN COFFEE SHOP,Bronx,2953,WEBSTER AVENUE,10458,7183652277,Coffee/Tea,7/30/2021,Violations were cited in the following area(s).,08A,Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.,Not Critical,31,,,8/1/2021,Cycle Inspection / Initial Inspection,40.86759042,-73.88308647,207,11,41500,2016446,2032800061,BX05

Then a sample run of the program:

Enter input file name: restaurants30July.csv
Enter output file name:  july30filtered.csv

And the first three lines of july30filtered.csv would be:

CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,restaurant_name,thai_boolean
41178124,CAFE 57,Manhattan,300,WEST  57 STREET,10019,+1-212-649-2729,American,2021/07/30,Violations were cited in the following area(s).,09C,Food contact surface not properly maintained.,Not Critical,4,A,7/30/2021,8/1/2021,Cycle Inspection / Initial Inspection,40.76643902,-73.98332508,104,3,13900,1025451,1010477502,MN15,Cafe 57 ,False
50111450,CASTLE CHICKEN,Bronx,5987A,BROADWAY,10471,+1-917-856-2047,Chicken,2021/07/30,Violations were cited in the following area(s).,05D,Hand washing facility not provided in or near food preparation area and toilet room. Hot and cold running water at adequate pressure to enable cleanliness of employees not provided at facility. Soap and an acceptable hand-drying device not provided.,Critical,41,N,,8/1/2021,Pre-permit (Operational) / Initial Inspection,40.88993027,-73.89805316,208,11,28500,2084208,2058011033,BX29,Castle Chicken ,False
40699339,NICK GARDEN COFFEE SHOP,Bronx,2953,WEBSTER AVENUE,10458,+1-718-365-2277,Coffee/Tea,2021/07/30,Violations were cited in the following area(s).,08A,Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.,Not Critical,31,,,8/1/2021,Cycle Inspection / Initial Inspection,40.86759042,-73.88308647,207,11,41500,2016446,2032800061,BX05,Nick Garden Coffee Shop ,False


Program 7: Neighborhood Tabulation AreasDue noon, Friday, 10 September.
(Learning Objective: The learning objective of this exercise is to give the students an opportunity to practice their newfound SQL skills.)

The package pandasql provides an easy way to use SQL queries directly on a Pandas DataFrame. (You may need to install it in your hierarchy (e.g. pip install pandasql or pip install pandasql).

Once installed, you can run queries via the function sqldf(queryName). For example, you could filter for all students in the roster.csv on the waitlist by:


import pandas as pd
import pandasql as psql

q = 'SELECT * FROM roster WHERE Role = "Waitlist Student"'
waitList = psql.sqldf(q)

print(waitList)



For this program, ask the user for the input and output file names. You should assume that the input file contains the New York City Neighborhood Tabulation Areas such as nynta.csv. Use sqldf(queryName) to filter the dataset to return the NTACode and NTAName columns, labeled as NTA and NTA_Name, respectively. You should save the result as a CSV in the output file named by the user.

Program 8: Restaurant SQL Queries.Due noon, Monday, 13 September.

Your program should ask for the input file name (must include .csv) and then for an output file prefix (must not include any extension). For example, with restaurantJuly2020.csv for the input and selected for the output prefix. The program should create 4 files: selectedA.csv, selected70.csv, selectedZIP.csv, and selectedAll.csv.

Using SQL (see DS 100: Section 6.2), extract the following information from a restaurant inspection dataset (a small file of inspections from 30 July is available: restaurants30July.csv):

• Save all rows in table where Grade = A to the output file prefix+"A.csv" where prefix holds the value specified by the user.
• Save all rows in table where SCORE > 70 to the output file prefix+"70.csv".
• Save all rows in table where ZIPCODE IN (10002, 10027, 10036) to the output file prefix+"ZIP.csv"
• Save columns DBA renamed as restaurant_name, CUISINE DESCRIPTION renamed as cuisine_description, BORO renamed as borough, and GRADE and select only those entries with: GRADE = A and ZIPCODE IN (10002, 10027, 10036) and save to the output file prefix+"All.csv".

Note: The file extension names are case-sensitive, so, the autograder will not except ... ALL.csv for ... All.csv.

Program 9: Aggregating Restaurant Data (SQL). Due noon, Tuesday, 14 September.
(Learning Objective: The learning objective of this exercise is to give the students an opportunity to practice more advanced SQL skills (e.g. GROUP BY's) on a familiar dataset.)

Using the more advanced SQL commands from DS 100: Section 5.1 (e.g. GROUP BY's), this program find distinct restaurant names and distinct cuisines by locale. For testing, a small file of inspections from 1 August is available: brooklynJuly2021.csv.

Your program should ask for the input file name (must include .csv) and then for an output file prefix (must not include any extension).

• Save a unique column of distinct restaurants (note: return the restaurant names, not the count) to the output file prefix+"Restaurants.csv" where prefix holds the value specified by the user.
• Save a unique column of distinct cuisines (note: return the cuisine names, not the count) where ZIPCODE = 11224 to the output file prefix+"Cuisines11224.csv" where prefix holds the value specified by the user.
• Save 2 columns, the cuisine and the count of unique restaurants per cuisine, where ZIPCODE = 11224 (note: return the cuisine names, not the count) to the output file prefix+"Counts11224.csv" where prefix holds the value specified by the user.
• Save 3 columns: the borough name, the unique count of cuisines per borough and the unique count of restaurants per boroughs. The results should be sorted results by the borough name (in ascending order: ASC) and saved to the output file prefix+"Boro.csv" where prefix holds the value specified by the user.

For example, if you entered brooklynJuly2021.csv and selected for the output prefix, the program should create 4 files: selectedRestaurants.csv, selectedCuisines11224.csv, selectedCounts11224.csv, and selectedBoro.csv. The first several lines of selectedRestaurants.csv are:

DBA
1 HOTEL BROOKLYN BRIDGE
14 OLD FULTON STREET
98K
99 CENT PIZZA
ABURI SUSHI BAR


The file selectedCuisines11224.csv is:

cnt
American

(since our test file only has restaurants that serve American food in the 11224 zipcode)

The file selectedCounts11224.csv is:

CUISINE DESCRIPTION,COUNT(DISTINCT DBA)
American,3


The file selectedBoro.csv is:

borough,cnt_cuisine,cnt_restaurants
Brooklyn,50,384


Program 10: Extracting Districts.Due noon, Monday, 20 September.
(Learning Objective: successfully write and apply functions to DataFrames to clean data.)

Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Your program should include a function, extractDistrict() that takes a string as an input and returns the number represented by the first two characters in the string:


def extractDistrict(name):
'''
Extracts the district (first two characters) as an integer.
Input:  Character string containing district + school num (e.g. "01M015")
Returns:  The first characters as an integer (e.g. 1)
'''

#### Your code goes here ####


Your program should apply the function to each row that takes the first two characters of the School field, converts those into a digit, and stores the results in a new column, District. That is,
df['District'] = df['DBN'].apply(extractDistrict)

For example, if the School is "01M015", the entry in the new column would be 1 (stored as a number, not a string).

The results should be written to a new CSV file, name provided by the user.

For testing, a small file of inspections from 30 July is available: restaurants30July.csv and the Neighborhood Tabulation Areas (NTA): nta.csv.

Your program should ask for two input file name (must include .csv) and then for an output file prefix (must not include any extension). For example, with restaurantJuly2020.csv and nta.csv for the input and selected for the output prefix. The program should create 6 files: selected1.csv, selected2.csv, selected3.csv, selected4.csv, selected5.csv, and selected6.csv.

1. Save the NTA column from the restaurant inspection table to the output file prefix+"1.csv" where prefix holds the value specified by the user.
2. Save the count of unique NTAs in the restaurant health inspection table to the output file prefix+"2.csv" where prefix holds the value specified by the user. (Note this will have a single column and a single value.)
3. Save the NTA column and the count of the distinct restaurants from the restaurant inspection table to the output file prefix+"3.csv" where prefix holds the value specified by the user. (Hint: how can you use GROUP BY to organize the output?)
4. Save the number of rows in the NTA table and the number of unique NTAs in the NTA table to the output file prefix+"4.csv" where prefix holds the value specified by the user. (Note this will have a two rows and two columns.)
5. Save the names of the restaurant and its NTA which can be found via a LEFT JOIN of the restaurant inspection table and NTA table. Save the results to the output file prefix+"5.csv" where prefix holds the value specified by the user. (Hint: join on the NTA code found in both (but using slightly different names). Your output should have two columns.)
6. Building on the result from 5) above, keep the LEFT JOIN as is, do one more level of aggregation, so that the end result contains 3 columns (unique NTA code, unique NTA description, and the count distinct restaurants as grouped by the first 2 columns). Save result to the output file prefix+"6.csv" where prefix holds the value specified by the user.

Program 12: MTA Ridership.Due noon, Thursday, 23 September.
(Learning Objective: to reinforce Pandas skills via use for data aggregating and data cleaning.)

In the next lecture, we will be summarizing time-series data and using a cleaned version of MTA subway and bus ridership, inspired by Oldenburg's NYC Transit Turnstile Data.

Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user, which you can assume will include the column names: date, entries, and exit. You should create a new file that has one entry for each date that consists of the sum of all entries and the sum of all exits that occur on the date. This aggegrate data should be stored in the output CSV and should contain only the three columns: date, entries, and exits, even if there are additional columns in the input CSV file.

For example, if the file was the 2020 data for Staten Island, rmta_trunc_staten_island.csv with the first 3 lines of:

stop_name,daytime_routes,division,line,borough,structure,gtfs_longitude,gtfs_latitude,complex_id,date,entries,exits
St George,SIR,SIR,Staten Island,SI,Open Cut,-74.073643,40.643748,501,2020-01-01,2929,0
St George,SIR,SIR,Staten Island,SI,Open Cut,-74.073643,40.643748,501,2020-01-02,13073,0
St George,SIR,SIR,Staten Island,SI,Open Cut,-74.073643,40.643748,501,2020-01-03,11857,23

Then a sample run of the program:
Enter input file name: mta_trunc_staten_island.csv
Enter output file name:  filteredSI.csv
And the first three lines of filteredSI.csv would be:
date,entries,exits
2020-01-01,3128,0
2020-01-02,13707,0
2020-01-03,12507,23

Set 3: The third set of programs integrates visualization techniques with analyzing structured data sets. While the programs do not cover every visualization technique, the practice these programs provide will be directly relevant to your project.

Program 13: Column Summaries.Due noon, Friday, 24 September.
(Learning Objective: to strengthen function-writing skills and examine alternate ways to summarize time-series data.)

In lecture, we used the Pandas' function, rolling() to compute a 7-day average of subway ridership for the visualization of ridership in 2020. For this program, write three functions that take as input a Pandas' series (e.g. a column of a DataFrame) that highlights different patterns in the data:

• cumulativeAverage(column): Assumes the input is a Series of numerical data. Returns a Series with the cumulative (running) average of the values. For example, if the first 5 entries are 10,20,30,40,20, the Series that is returned would start out as 10,15,20,25,24 (since 10/1 = 10, (10+20)/2 = 15, (10+20+30)/3 = 20, (10+20+30+40)/4 = 25, (10+20+30+40+20)/5 = 24). Note that it is different that rolling() we used in class, since this function creates a value for all entries and averages across all values seen.
• cyclicAverage(column): Assumes the input is a Series of numerical data. Returns a Series with the average of the current day with, if they exist, the value from 7 days previously and 14 days previously. That is, if they exist, for entry at index i, take the average of the values at indices i, i-1*offset, and i-2*offset, as the computation. Since ridership is highly dependent on the day of the week, this averages the values of the same day in past weeks.
• exponentialSmoothing(column): Assumes the input is a Series of numerical data. Returns a Series with a weighted average of the previous values with the most recent values have higher weight and the older ones have lower weights. The value for the first entry, newCol[0] is column[0]. The value for subsequent entries is newCol[t+1] = 0.5*column[t+1] + 0.5*newCol[t]. For example, if we had the same Series starting the same as above, 10,20,30,40,20, our new column would start 10,15,22.5,31.25,25.625 (since the first entry is the same, 0.5*20+0.5*10 = 15, 0.5*30+0.5*15= 22.5, 0.5*40+0.5*22.5 = 31.25, 0.5*20 + 0.5*31.25 = 25.625).
Note: you should submit a file with only the standard comments at the top, and these three functions. The grading scripts will then import the file for testing.

Write two functions that will be used to clean the OpenData NYC dataset of Libraries in New York City (downloaded as CSV file). The first three lines of the CSV file look like:


the_geom,NAME,STREETNAME,HOUSENUM,CITY,ZIP,URL,BIN,BBL,X,Y,SYSTEM,BOROCODE
POINT (-73.95353074430393 40.80297988196676),115th Street,West 115th Street,203,New York,10026,http://www.nypl.org/locations/115th-street,1055236,1018310026,997115.12977,231827.652864,NYPL,1
POINT (-73.9348475633247 40.80301816141575),125th Street,East 125th Street,224,New York,10035,http://www.nypl.org/locations/125th-street,1054674,1017890037,1002287.604,231844.894956,NYPL,1

Each function takes as input a row of the table:
• extractLatLon(row): This function takes the values from the column the_geom and extracts the longitude and latitude from the string (they are surrounded by parenthesis and separated by a space, and returns the two as numerical values. For example, the function would return -73.95353074430393, 40.80297988196676 when applied to the first row of data.
• extractTitle(row): This function concatenates the values from the columns NAME, CITY, and ZIP code into a single string, separated by a comma and space, and returns the string (to be used as the title for our visualizations). For example, when applying this function to the first data row, the return value would be: 115th Street, New York, 10026.

Note: you should submit a file with only the standard comments at the top, and these two functions. The grading scripts will then import the file for testing. A sample test program that assumes your program is called p14.py and the CSV file is called LIBRARY.csv is test14.py.

Program 15: Plotting Challenge.Due noon, Tuesday, 28 September.
(Learning Objective: to explore and master matplotlib.pyplot commands to create data visualizations.)

The goal is to create a plot of NYC OpenData Motor Vehicle Collisions that follows this style. For example, here is the plot for January 2020 dataset:

Your program should begin by asking the user for input and output files. It should be written to take any dataset from the NYC OpenData Motor Vehicle Collisions and produce an image that matches this style. The resulting image should be saved to the output file specified by the user.

Hint: to transform the data into separate columns (i.e. "unstack"/pivot the groups to be columns) for the daily number of collisions for each borough:

boroDF = df.groupby(['BOROUGH','CRASH DATE']).count()['CRASH TIME'].unstack().transpose()

where df is the DataFrame with the collisions data.

Program 16: Choropleth Attendance Cleaning.Due noon, Thursday, 30 September.
(Learning Objective: to gain competency cleaning data using pandas functions.)

In lecture, we wrote a program, schoolsChoropleth.py, using the school district files used in Programs 10 & 11 to make a choropleth map of top English Languange Arts scores, by district, in New York City:

For this program, write a program that will clean district school attendance data so that we can use the same visualization to see attendance for different districts.

Your stand-alone program should ask the user for the input file name, the output file name, as well as the grade and school year to use as filters. For example, a sample run of the program on public-district-attendance-results-2014-2019.csv:


Enter input file name: public-district-attendance-results-2014-2019.csv
Enter year: 2018-19

If the input file starts as:

District,Grade,Year,Category,# Total Days,# Days Absent,# Days Present,% Attendance,# Contributing 20+ Total Days,# Chronically Absent,% Chronically Absent

then the output file would start:

District,Grade,Year,Category,# Total Days,# Days Absent,# Days Present,% Attendance,# Contributing 20+ Total Days,# Chronically Absent,% Chronically Absent
1,3,2018-19,All Students,149871,10601,139270,92.9,876,228,26.0
2,3,2018-19,All Students,491432,21170,470262,95.7,2844,278,9.8
3,3,2018-19,All Students,254506,15395,239111,94.0,1488,274,18.4


Hints:

• In the CSV file, Grade can contain both numbers and strings. Since "3" is different from the number 3, you may want to convert to strings before comparison.
• The output CSV should not contain the internal index and should be saved without it.

Program 17: Grouping ELA/Math by Districts.Due noon, Friday, 1 October.
(Learning Objective: to successfully combine information from multiple input files and display the results using a pivot table.)

Your program should build on the classwork from Lectures #6 and #9 to build a pivot table, grouped by district and test subject, of the percentage of students that are proficient in each (i.e. score 3 or 4 on the exam). Your program should ask the user for two input CSV files and print a pivot table.

Then a sample run of the program with files truncated to a few schools per district for testing (ela_trunc.csv and math_trunc.csv) starts as:
Enter file containing ELA scores: ela_trunc.csv
Enter file containing MATH scores: math_trunc.csv
Proficiency                      School Name
District Subject
01       ELA        91.891892  THE EAST VILLAGE COMMUNITY SCHOOL
MATH       84.615385               P.S. 184M SHUANG WEN
02       ELA        96.825397           P.S. 77 LOWER LAB SCHOOL
MATH       98.412698           P.S. 77 LOWER LAB SCHOOL

and continues with top scoring schools for each test and each district printed.

Hints:

• Work first with the math scores and get those scores selected. Then, repeat with the ELA scores. Once both have the correct values, combine for the final answer.
• Look at the names used in the row and column indices, and use similar names for your dataframes to avoid having to rename them later.
• A very useful aggregate function is idxmax() which returns the index of the row where the maximum occurs, instead of just the value itself. For example, if you had a DataFrame mathdf with the proficiency column already computed, the following will list all the columns for the school with the maximum profiency for each district:
mathdf = mathdf.loc[ mathdf.groupby('District')['Proficiency'].idxmax() ]

Program 18: Log Scale.Due noon, Monday, 4 October.
(Learning Objective: gain competency in scaling data via log transformations.)

In Lecture #9 and Section 11.5, we used log scale to visualize data. Since the logarithm function is not defined on non-positive data, we are first going to write a function that removes any tuple that has a 0 or negative value. Our second function transformed the cleaned data to its log values.

Write two functions that to be used to display data on a log-scale. Each function takes and returns two iterables of numeric values (e.g. a Series, np.array, or list restricted to numeric values). Each function takes as input a row of the table:

• dropNeg(xS,yS): This function takes two iterables, xS and yS of numeric values. If any entry is not positive in either iterable, that indexed value is dropped from both series. The results are returned as two separate iterables. To do this, first zip the series together, drop all the pairs with zero or negative values, and then unzip to return series with only positive values.

For example, if xS contains [1,2,0,3,4] and yS contains [0,-1.5,4,3,9], then the zip(xS,yS) has entries [(1,0),(1,-1.5),(0,4),(3,3),(4,9)]. Dropping all tuples that contain non-positive values yields [(3,3),(4,9)], and the unzipped results, [3,4] and [3,9], are returned.

• logScale(xS,yS): This function assumes that the inputted iterables contain numeric values, are positive and not null, and returns the np.log of each. For example, when applying this function to the inputs [3,4] and [3,9], the function returns [1.098612, 1.386294] and [1.098612,2.19722458].

Note: you should submit a file with only the standard comments at the top, and these two functions. The grading scripts will then import the file for testing. A sample test program that assumes your program is called p18.py and is test18.py.

Program 19: Smoothing with Gaussians.Due noon, Tuesday, 5 October.
(Learning Objective: increase understanding of smoothing and gain fluidity with using distributions for smoothing.)

In Lecture #9 and Section 11.5, we used smoothing to visualize data. For this program, write a function that takes two arguments, an Numpy array of x-axis coordinates, and a list of numeric values, and returns the corresponding y-values for the sum of the gaussian probability distribution functions (pdf's) for each point in the list.

• computeSmoothing(xes,points): This function takes a numpy array xes and a list, points, of numeric values. For each p in points, the function should compute the normal probability distribution function (scipy.norm.pdf) centered at loc = p with standard deviation scale = 0.5 for all values in xes. The return value is a numpy array of the sum of these at each point.

For example, calling the function:

xes = np.linspace(0, 10, 1000)
density = computeSmoothing(xes,[5])
plt.plot(xes,density)
plt.show()
would give the plot:

since there is only one point given (namely 5), the returned value is the probability density function centered at 5 (with scale = 0.5) computed for each of the xes.

For example, calling the function:

pts = [2,2,5,5,2,3,4,6,7,9]
xes = np.linspace(0, 10, 1000)
density = computeSmoothing(xes,pts)
plt.plot(xes,density)
plt.fill_between(xes,density)
plt.show()
would give the plot:

since the there are 10 points given, the function computes the probability density function centered at each of the points, across all the values in xes. It then sums up these contributions and returns an array of the same length as xes.

Note: you should submit a file with only the standard comments at the top, and this function. The grading scripts will then import the file for testing.

Hint: Include only the function you need (such as numpy and scipy.stats) and none of the ones for plotting (such as matplotlib.pyplot and seaborn) since this function is computing and not plotting.

Set 4: The fourth set of programs introduces modeling and estimation, focusing on loss functions and linear modeling.

Program 20: Loss Functions for Tips.Due noon, Thursday, 7 October.
(Learning Objective: strengthen competency with loss functions by applying the techniques to a dataset of tips.)

In Lecture #10 and Section 4.2, we introduced loss functions to measure how well our estimates fit the data.

Using the functions mean squared loss function mse_loss and mean absolute loss function abs_loss (Section 4.2), write two separate functions that take in estimates and tip data and returns the respective loss function for each of the estimates to the data.

• mse_estimates(thetas,tips): This function takes two iterables of numeric values:
• thetas: estimates for the population parameter for the percent tips in values, and
• tips: the tips observed, assumed to be a positive percentage, ranging from 0 to 100.
For each theta in thetas, it should compute the mean squared error between theta and tips. Return an iterable of the values computed.
• mae_estimates(thetas,tips): This function takes two iterables of numeric values:
• thetas: estimates for the population parameter for the percent tips in values, and
• tips: the tips observed, assumed to be a positive percentage, ranging from 0 to 100.
For each theta in thetas, it should compute the mean absolute error between theta and tips. Return an iterable of the values computed.
Note: for each of these functions, your returned value will be an iterable with the same length as thetas.

For example, calling the function:

thetas = np.array([12, 13, 14, 15, 16, 17])
y_vals = np.array([12.1, 12.8, 14.9, 16.3, 17.2])
mse_losses = p20.mse_estimates(thetas,y_vals)
abs_losses = p20.mae_estimates(thetas,y_vals)
plt.scatter(thetas, mse_losses, label='MSE')
plt.scatter(thetas, abs_losses, label='MAE')
plt.title(r'Loss vs. $\theta$ when $\bf{y}$$= [ 12.1, 12.8, 14.9, 16.3, 17.2 ]$')
plt.xlabel(r'$\theta$ Values')
plt.ylabel('Loss')
plt.legend()
plt.show()
would give the plot:

For example, calling the function:

thetas = np.arange(30)
tipsPercent = (tips_df['tip']/tips_df['total_bill'])*100
mse_losses = p20.mse_estimates(thetas, tipsPercent)
abs_losses = p20.mae_estimates(thetas, tipsPercent)
plt.plot(thetas, mse_losses, label='MSE')
plt.plot(thetas, abs_losses, label='MAE')
plt.title(r'Loss vs. $\theta$ for sns tips data')
plt.xlabel(r'$\theta$ Values')
plt.ylabel('Loss')
plt.legend()
plt.show()
would give the plot:

Note: you should submit a file with only the standard comments at the top, and this function. The grading scripts will then import the file for testing.

Hint: Include only the libraries you need (such as numpy) and none of the ones for plotting (such as matplotlib.pyplot and seaborn) since this function is computing and not plotting.

Program 21: Taxi Cleaning.Due noon, Friday, 8 October.
(Learning Objective: To build up (or refresh) skills at manipulating tabular data, in particular, to use arithmetic operations on columns to create new columns.)

Write a program, tailored to the NYC OpenData Yellow Taxi Trip Data, that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Next, the program should copy the input file and create two new columns: percent_tip, which is 100*tip_amount/fare_amount and percent_fare, which is 100*fare_amount/total_amount. Your program should write out a new CSV file (with the name provided by the user) with the original columns as well as the two newly computed ones.

For example, if the file, taxi_new_years_day_2020.csv, was the first of January 2020 entries downloaded from 2020 Yellow Taxi Trip Data (about 170,000 entries) with the first 3 lines of:

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
1,01/01/2020 12:00:00 AM,01/01/2020 12:13:03 AM,1,2.2,1,N,68,170,1,10.5,3,0.5,2.85,0,0.3,17.15,2.5
2,01/01/2020 12:00:00 AM,01/01/2020 01:08:55 AM,5,1.43,1,N,48,239,2,6.5,0.5,0.5,0,0,0.3,10.3,2.5

Then a sample run of the program:
Enter input file name: taxi_new_years_day2020.csv
Enter output file name:  taxi_Jan2020_with_percents.csv

And the first three lines of taxi_Jan2020_with_percents.csv would be:
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,percent_tip,percent_fare
1.0,01/01/2020 12:00:00 AM,01/01/2020 12:13:03 AM,1.0,2.2,1.0,N,68,170,1.0,10.5,3.0,0.5,2.85,0.0,0.3,17.15,2.5,27.1,61.2
2.0,01/01/2020 12:00:00 AM,01/01/2020 01:08:55 AM,5.0,1.43,1.0,N,48,239,2.0,6.5,0.5,0.5,0.0,0.0,0.3,10.3,2.5,0.0,63.1


You should round the values stored in your new columns to the nearest tenth and save your CSV file without the indexing (i.e. index=False).

Program 22: Dice Simulator.Due noon, Thursday, 14 October.
(Learning Objective: students will be able to apply their knowledge of the built-in random package to generate simulations of simple phenomena.)

Write a function:

Since the numbers are chosen at random, the fractions will differ some from run to run. One run of the function print(p22.diceSim(6,6,10000)) resulted in:


[0.     0.     0.0259 0.0615 0.0791 0.1086 0.139  0.1633 0.1385 0.114  0.0833 0.0587 0.0281]
or displayed using the code from
Section 16.1.1.:

Note: you should submit a file with only the standard comments at the top and the function. The grading scripts will then import the file for testing.

Program 23: Correlation Coefficients.Due noon, Friday, 15 October.
(Learning Objective: to refresh students' knowledge of Pearson's correlation coefficient and to increase fluidity with using statistical functions in Python.)

Write a function that will find the columns with highest absolute correlation coefficents in a DataFrame. Your program should take as inputs the column of interest, a list of possible correlated columns, and the DataFrame. The function should return the name and Pearson's R correlation coefficent (can be computed using the Pandas function series1.corr(series2) where series1 and series2 are Pandas Series):

For example, assuming your function findHighestR() was in the p23.py:

simpleDF = pd.DataFrame({'c1': [1,2,3,4],\
'c2': [0,1,0,1],\
'c3': [1,10,3,20],\
'c4': [-10,-20,-30,-40],})
print('Testing with c1 and [c3,c4]:')
print(p23.findHighestCorr('c1',['c3','c4'],simpleDF))
print(f'c1 has highest absolute r with {p23.findHighestCorr("c1",simpleDF.columns, simpleDF)}.')

Would give output:
Testing with c1 and [c3,c4]:
('c4', -1.0)
c1 has highest absolute r with ('c1', 1.0)

since the correlation cofficient between simpleDF['c1'] and the other 3 columns is 0.4472135954999579, 0.7520710469952336, and  -1.0, respectively and the largest absolute correlation is with simpleDF['c4'].

Using the function on the seaborn tips dataset:

import seaborn as sns
print(f"Correlation coefficient between tips and size is \
{tips['tip'].corr(tips['size'])}")
print(f"For tip, the highest correlation is \
{p23.findHighestCorr('tip',['total_bill','size'],tips)}.")

will print
Correlation coefficient between tips and size is         0.4892987752303577
For tip, the highest correlation is         ('total_bill', 0.6757341092113641).


Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import your file for testing.

Write a function, computeEnrollments(), that takes a DataFrame that contains students' names, number of credits completed, and current courses (a string with the course names separated by  ), and returns a DataFrame that

1. Includes only students taking 3 or more courses, and
2. Replaces the column of current courses with three different columns: the first that counts total number of courses the student is taking, the second columns has the number of computer science courses currently taking (all courses that start 'csci') and the third column with the number of other courses the student is taking.

For example, assuming your function computeEnrollments() was in the p24.py:

classDF = pd.DataFrame({'Name': ["Ana","Bao","Cara","Dara","Ella","Fatima"],\
'# Credits': [45,50,80,115,30,90],\
'Current Courses': ["csci160 csci235 math160 jpn201",\
"csci160 csci235 cla101 germn241",\
"csci265 csci335 csci39542 germn241",\
"csci49362 csci499",\
"csci150 csci235 math160",\
"csci335 csci39542 cla101 dan102"]})
print(f'Starting df:\n {classDF}')
print(f'Ending df:\n {p24.computeEnrollments(classDF)}')

Would give output:
Starting df:
Name  # Credits                     Current Courses
0     Ana         45      csci160 csci235 math160 jpn201
1     Bao         50     csci160 csci235 cla101 germn241
2    Cara         80  csci265 csci335 csci39542 germn241
3    Dara        115                   csci49362 csci499
4    Ella         30             csci150 csci235 math160
5  Fatima         90     csci335 csci39542 cla101 dan102

Ending df:
Name  # Credits  NumCourses  CS  Other
0     Ana         45           4   2      2
1     Bao         50           4   2      2
2    Cara         80           4   3      1
4    Ella         30           3   2      1
5  Fatima         90           4   2      2


The resulting DataFrame has only 5 students, since the student, Dara, has fewer than 3 current courses and that row is dropped.

Hints:

• While straightforward, this problem is much easier to do in pieces, debugging/printing, as you go. Add lots of print statements (or work inside a debugger) to make sure each part is correct before building on the next one.
• To compute the number of courses taken, count the number of spaces and add 1.
• Using the string (.str) methods are an excellent approach, but you may get a SettingWithCopyWarning since it is concerned that the new column is being set based on a subsetted value (i.e. the particular column). To explicitly assign the values, use df.assign. For example, if you have computed the number of courses and stored them in a Series, totalCourses, you can then df = df.assign(NumCourses = totalCourses) to create a new column, NumCourses, in the DataFrame.

Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing.

Section 16.1 (Random Variables) of the textbook has a small example of computing the probability mass function of a data set of ages was computed by hand. Write a function that will automate this process:

• pmf(vals): This function takes one parameter:
• vals: an iterable of numeric values, assumed to be positive numbers.
• The function should return two iterables of positive numbers:
• uniqueVals: a sorted list of the unique values from the input parameter vals, and
• mass: the fraction that each uniqueVals occurs in vals.
• The order of mass should follow that of uniqueVals. That is, the ith element of mass should be the fraction computed for the ith value of uniqueVals.
• The sum(mass) should equal 1.

For example, calling the function on the example from the textbook:

x, y = p25.pmf([50,50,52,54])
print(f'The values are: {x}')
print(f'The pmf is: {y}')
print(f'The sum of the pmf is: {sum(y)}.')
plt.bar(x,y)
plt.show()
would print:
The values are: (50, 52, 54)
The pmf is: (0.5, 0.25, 0.25)
The sum of the pmf is: 1.0.
and would give the plot:

For example, calling the function on the senators' ages from Program 3:

senators = pd.read_csv('senatorsAges.csv')
xSen,ySen = p25.pmf(senators['age'])
plt.bar(xSen,ySen)
plt.show()
would give the plot:

Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing.

Hint: Include only the libraries you need (such as numpy) and none of the ones for plotting (such as matplotlib.pyplot and seaborn) since this function is computing and not plotting.

Program 26: Weekday Entries.Due noon, Thursday, 21 October.
(Learning Objective: to strengthen data cleaning skills and familiarity with standard date/time formats.)

Use the date time functionality of Pandas to write the following functions:

• tripTime(start,end): This function takes two variables of type datetime and returns the difference between them.
• weekdays(df,col): This function takes a DataFrame, df, containing the column name, col, and returns a DataFrame containing only times that fall on a weekday (i.e. Monday through Friday).
Give example with green taxi data from seaborn, use first 10 lines, and create new column as well as filter for weekdays.

For example, using the Seaborn's Green Taxi Data Set and assuming your functions are in the p26.py:

taxi = sns.load_dataset('taxis')
print(taxi.iloc[0:10])  #Print first 10 lines:
taxi['tripTime'] = taxi.apply(lambda x: p26.tripTime(x['pickup'], x['dropoff']), axis=1)
print(taxi.iloc[0:10])

Would give output:
                pickup              dropoff  ...  pickup_borough  dropoff_borough
0  2019-03-23 20:21:09  2019-03-23 20:27:24  ...       Manhattan        Manhattan
1  2019-03-04 16:11:55  2019-03-04 16:19:00  ...       Manhattan        Manhattan
2  2019-03-27 17:53:01  2019-03-27 18:00:25  ...       Manhattan        Manhattan
3  2019-03-10 01:23:59  2019-03-10 01:49:51  ...       Manhattan        Manhattan
4  2019-03-30 13:27:42  2019-03-30 13:37:14  ...       Manhattan        Manhattan
5  2019-03-11 10:37:23  2019-03-11 10:47:31  ...       Manhattan        Manhattan
6  2019-03-26 21:07:31  2019-03-26 21:17:29  ...       Manhattan        Manhattan
7  2019-03-22 12:47:13  2019-03-22 12:58:17  ...       Manhattan        Manhattan
8  2019-03-23 11:48:50  2019-03-23 12:06:14  ...       Manhattan        Manhattan
9  2019-03-08 16:18:37  2019-03-08 16:26:57  ...       Manhattan        Manhattan

[10 rows x 14 columns]
pickup              dropoff  ...  dropoff_borough        tripTime
0  2019-03-23 20:21:09  2019-03-23 20:27:24  ...        Manhattan 0 days 00:06:15
1  2019-03-04 16:11:55  2019-03-04 16:19:00  ...        Manhattan 0 days 00:07:05
2  2019-03-27 17:53:01  2019-03-27 18:00:25  ...        Manhattan 0 days 00:07:24
3  2019-03-10 01:23:59  2019-03-10 01:49:51  ...        Manhattan 0 days 00:25:52
4  2019-03-30 13:27:42  2019-03-30 13:37:14  ...        Manhattan 0 days 00:09:32
5  2019-03-11 10:37:23  2019-03-11 10:47:31  ...        Manhattan 0 days 00:10:08
6  2019-03-26 21:07:31  2019-03-26 21:17:29  ...        Manhattan 0 days 00:09:58
7  2019-03-22 12:47:13  2019-03-22 12:58:17  ...        Manhattan 0 days 00:11:04
8  2019-03-23 11:48:50  2019-03-23 12:06:14  ...        Manhattan 0 days 00:17:24
9  2019-03-08 16:18:37  2019-03-08 16:26:57  ...        Manhattan 0 days 00:08:20

[10 rows x 15 columns]


Using the function our second function:

taxi = sns.load_dataset('taxis')
weekdays = p26.weekdays(taxi,'pickup')
print(weekdays.iloc[0:10])

will give output:

pickup              dropoff  ...  pickup_borough  dropoff_borough
1   2019-03-04 16:11:55  2019-03-04 16:19:00  ...       Manhattan        Manhattan
2   2019-03-27 17:53:01  2019-03-27 18:00:25  ...       Manhattan        Manhattan
5   2019-03-11 10:37:23  2019-03-11 10:47:31  ...       Manhattan        Manhattan
6   2019-03-26 21:07:31  2019-03-26 21:17:29  ...       Manhattan        Manhattan
7   2019-03-22 12:47:13  2019-03-22 12:58:17  ...       Manhattan        Manhattan
9   2019-03-08 16:18:37  2019-03-08 16:26:57  ...       Manhattan        Manhattan
11  2019-03-20 19:39:42  2019-03-20 19:45:36  ...       Manhattan        Manhattan
12  2019-03-18 21:27:14  2019-03-18 21:34:16  ...       Manhattan        Manhattan
13  2019-03-19 07:55:25  2019-03-19 08:09:17  ...       Manhattan        Manhattan
14  2019-03-27 12:13:34  2019-03-27 12:25:48  ...       Manhattan        Manhattan

[10 rows x 14 columns]

note that rows 0,4,8, and 10 have been dropped from the original DataFrame since those corresponded to weekend days.

Hints:
• When read in from the CSV, the columns may be stored as a string. Cast as datetime (e.g. pd.to_datetime(start)) to use the functionality.
• For datetime objects, you can access properties such as day of the week using dt prefix, similar to .strsimilar to .str to use string methods and properties (e.g. dt.dayofweek. See the Python Docs: date time functionality for more details.

Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing.

More to come...

## Project

The required final project synthesizes the skills acquired in the course to analyze and visualize data on a topic of your choosing. It is your chance to demonstrate what you have learned, your creativity, and a project that you are passionate about. The intended audience for your project is your classmates as well as tech recruiters and potential employers.

### Milestones

The project is broken down into smaller pieces that must be submitted by the deadlines below. For details of each milestone, see the links. The project is worth 25% of the final grade. The point breakdown is listed in the right hand column.

Wednesday, 6 October, noon Pre-Proposal 15 Gradescope
Wednesday, 20 October, noon Title & Proposal 20 Blackboard
Wednesday, 27 October, noon Peer Review #1 15 Blackboard
Wednesday, 3 November, noon Check-in #1 (Data Collection) 20
Wednesday, 10 November, noon Check-in #2 (Analysis) 20
Wednesday, 17 November, noon Check-in #3 (Visualization) 20
Wednesday, 1 December, noon Draft Abstract & Website 25
Monday, 6 December, noon Peer Review #2 15 Blackboard
Thursday, 9 December, noon Abstract 25
Friday, 10 December, noon Complete Project Website 50
Monday, 13 December, noonProject Video25 Blackboard
Total Points: 250

### Pre-Proposal

This pre-proposal is meant to guide you as you brainstorm about your project. It will also lead up to a more formal and structured project proposal later on. The window for submitting pre-proposals opens Wednesday, 29 September. If you would like feedback and the opportunity to resubmit for a higher grade, submit early in the window. Feel free to re-submit as many times as you like, up until the assignment deadline. The instructing team will work hard to give feedback on your submission as quickly as possible, and we will grade them in the order they were received.

In the pre-proposal, answer each question with 1 to 2 sentences:

• Overview: Describe your project in layman terms. Think of this as an elevator pitch.
• Importance: Describe why this project has personal significance to you.
• Originality: Describe why you believe this project idea is unique and original.
• Methods: Describe how you are planning to apply the data science skills you learned in this class to this project.
• Data: What dataset(s) are you thinking of using?

### Title & Proposal

The title and proposal serve multiple purposes: they provide a framework to structure the proposed work, can form the overview on your project website, and make the basis of an elevator pitch when asked in interviews to explain your project from your digital portfolio.

The structure echos that of the pre-proposal; an excellent way to start is to expand your pre-proposal to the template below, incorporating the feedback from the pre-proposal. Submission is via Blackboard Turnitin, and the file formats accepted include PDF, HTML, DOC, and RTF files as well as plain text. Your project proposal will be evaluated by the rubric below by three other students in the course.

Your file should include the following:

• Title: a title that accurately captures your project.
• Objective: A paragraph concisely describing the problem you need to solve and your recommended solution. A tl/dr as you will. Keep it short: 2 - 3 sentences. More background information and details of the solution are in subsequent sections. Feel free to use portions of the pre-proposal here.
• Background: Why is it important to solve this problem? Whom is it for? Give some general context for the reader.
• What is the existing solution(s), if any, and what are the drawbacks?
• Key Term Definitions: What are some terms specific to your project that someone might not know? ELI5 them here.
• Background resources: Include a link to any research you have done here.
• Solution Overview: Give a high-level overview of your proposed solution. Diagrams are very useful in this section to describe the solution (e.g. how the data flows through the model) but not required. Also include:
• Dependencies and assumptions: What version of Python is being used and what additional libraries?
• Inputs: Include a list of your planned data sources with URL's. (Note all data must be publicly available.)
• Expected Outputs: What outputs do you plan to generate? Include planned data visualizations, descriptive statistics, models, and predictions.
• Success Metric: What is defined as a success? How is that measured?
• Security and Privacy Considerations: Will you be working with personal identifiable information (PII)? Can your model be mis-used for evil, not good? How do you plan to mitigate that?
• Appendix: (Optional): Include miscellaneous action items, decisions, links, research, and any open questions still not addressed.

### Peer Review #1

The proposal and titled will be graded following the rubric below by three other students.

#### Grading Rubric for First Peer Review:

1. Does the title accurately capture the planned project?
2. Read through the proposal and describe it in your own words in 2 sentences.
3. Does the objective section clearly describe the project? What would you add to make it clearer or more reflective of the project?
4. Why is this project important or interesting? Include the reasons from the proposal. If you found none, provide two reasons.
5. Was the explanation of key terms sufficient?
6. Did the links provided work? Was the data chosen well and sufficient to accomplish the objectives above?
7. Are the libraries and dependencies appropriate for the project?
8. Are the planned outputs appropriate for the project?
9. Were security and privacy considerations handled sufficiently?
10. Based on their success metric, do you think the proposed solution will be successful? Why or why not?

More to come...