Unless otherwise noted, classwork is submitted via Gradescope. Access information is given during the corresponding lecture.
Due to the internet issues in the lecture hall, for Classwork 2 onward, the classwork will be available until midnight. If you attended class that day, there is an option to earn 0.5 points for attendance and space to include the row and seat number. If you were not able to attend a given lecture, you can still work through the classwork at home and we will replace the fractional point for that classwork with the grade you earned on the final exam. Do not say you were in the room if you did not attend.
Classwork 0: Due midnight, Monday, 31 January.
Available on Gradescope, this classwork focuses on the course
syllabus.
If you do have access to the course on Gradescope, write to datasci@hunter.cuny.edu
. Include in your email that you not receive a Gradescope invitation, your preferred email, and we will manually generate an invitation.
Classwork 1: Due 4pm, Monday, 31 January.
Available during Lecture 1 on Gradescope (paper version also available for those without a phone or laptop at lecture), this classwork complements the exploratory data analysis of names and foreshadows the sampling of data in Lecture 2.
Classwork 2: Due midnight, Thursday, 3 February.
Available during Lecture 2 on Gradescope, this classwork introduces the autograder that is used for the programming assignments. The structure of the sample program mirrors the structure and content of the upcoming Program 1. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture. Write a function that takes the name of a file and makes a dictionary of the lines of the file.
For example, assuming these functions are in a file, Another example with nick_names.txt:
If you attended lecture, include the last three lines to the the introductory comment:
Classwork 3: Due midnight, Monday, 7 February.
Available during Lecture 3 on Gradescope, this classwork asks that you write a program using Pandas and its file I/O. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.
Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user.
Next, the program should select rows where the field
Then a sample run of the program:
Hints:
If you attended lecture, include the last three lines to the the introductory comment:
Classwork 4: Due midnight, Thursday, 10 February.
Available during Lecture 4 on HackerRank, this classwork introduces the timed coding environment used for quizzes. This classwork mirrors the structure and content of the upcoming Quiz 1.
To get the most out of this exercise, bring an electronic device on which you can easily type into a web-based IDE (possible on a phone, but much easier with the bigger screen and keyboards on some tablets and most laptops.
Classwork 5: Due midnight, Monday, 14 February.
Available during Lecture 5 on Gradescope, this classwork focuses on the structure and topics for the optional project, based on the project overview in lecture.
Classwork 6: Due midnight, Thursday, 17 February.
Available during Lecture 6 on Gradescope, this on-line assignment reviews the different ways to merge DataFrames in Pandas.
Classwork 7: Due midnight, Thursday, 24 February.
Available during Lecture 7 on Gradescope, this classwork introduces regular expressions for data cleaning. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.
Write a program that asks the user for the name of an input HTML file and the name of an output CSV file. Your program should use regular expressions (see Chapter 12.4 for using the For example, if the input file is:
If you attended lecture, include the last three lines to the the introductory comment:
Classwork 8: Due midnight, Monday, 28 February.
Available during Lecture 8 on Gradescope, this classwork focuses on the GeoJSON format, including hands-on activity with GeoJSON visual editor. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.
Classwork 9: Due midnight, Thursday, 3 March.
Available during Lecture 9 on Gradescope, this classwork is modeled on an analytic reasoning challenge of efficiently computing catchment areas (Voronoi diagrams) for NYC libraries. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 10: Due midnight, Monday, 7 March.
Available during Lecture 10 on Gradescope, this classwork builds intuition about how correlated datasets are. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 11: Due midnight, Thursday, 10 March.
Available during Lecture 11 on Gradescope, this classwork reviews probability distributions and sampling. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 12: Due midnight, Monday, 14 March.
Available during Lecture 12 on Gradescope, this classwork focuses on empirical analysis of random variables. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.
Write a function:
Since the numbers are chosen at random, the fractions will differ some from run to run. One run of the function
If you attended lecture, include the last three lines to the the introductory comment:
Classwork 13: Due midnight, Thursday, 17 March.
Available during Lecture 13 on Gradescope, this classwork focuses on gradient descent. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 14: Due 4pm, Monday, 21 March.
Available during Lecture 14 on Gradescope, this classwork was a reviewed the first half of the course, following the topics in DS100 Fall 19 Midterm. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 15: Due 4pm, Thursday, 24 March.
Available during Lecture 15 on Gradescope, this classwork focuses on regularization. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 16: Due 4pm, Monday, 28 March.
Available during Lecture 16 on Gradescope, this classwork focuses on computing loss functions. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 17: Due 4pm, Thursday, 31 March.
Available during Lecture 17 on Gradescope, this classwork focuses on nominal data. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 18: Due 4pm, Monday, 4 April.
Available during Lecture 18 on Gradescope, this classwork focuses on linear separability and classification. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 19: Due 4pm, Thursday, 7 April.
Available during Lecture 19 on Gradescope, this classwork focuses on linear algebra. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 20: Due 4pm, Monday, 11 April.
Available during Lecture 20 on Gradescope, this classwork focuses on intrinsic dimensionality. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 21: Due 4pm, Thursday, 14 April.
Available during Lecture 21 on Gradescope, this classwork focuses on distance metrics. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Using Google Maps API, we generated the amount of time it would take to travel between the following landmarks:
Of the three, which best estimates the (aerial) distance?
Classwork 22: Due 4pm, Monday, 25 April.
Available during Lecture 22 on Gradescope, this classwork focuses on supervised vs. unsupervised learning. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 23: Due 4pm, Monday, 28 April.
Available during Lecture 23 on Gradescope, this classwork focuses on clustering via K-means. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 24: Due midnight, Wednesday, 4 May.
Available during Lecture 24 on Gradescope, this classwork asks your final examination plans (e.g. Do you need a laptop for the coding exam? Would you prefer a left-handed desk for the exam? Do you have accommodations from the Office of Accessability? Do you need to take the exam(s) early?, etc.). If no survey is submitted, the default is that you will take the exams during the regularly assigned times at a right-handed desk and will bring your own laptop to the coding exam.
Classwork 25: Due 4pm, Thursday, 5 May.
Available during Lecture 25 on Gradescope, this classwork focuses on SQL. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 26: Due 4pm, Monday, 9 May.
Available during Lecture 26 on Gradescope, this classwork is a review for the final examinations. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.
Classwork 27: Due 4pm, Thursday, 12 May.
Available during Lecture 27 on Gradescope, this classwork is on paper at your assigned seat for the exams (or in the front section, if you have requested to take either exam at an alternate time).
Quiz 1: Core Python. Due 4pm, Friday, 11 February.
Link to access HackerRank available at the end of Lecture 4 (posted on Blackboard).
Quiz 2: Pandas Basics. Due 4pm, Friday, 18 February.
Link to access HackerRank available at the end of Lecture 6 (posted on Blackboard).
Quiz 3: Aggregating in Pandas. Due 4pm, Friday, 25 February.
Link to access HackerRank available at the end of Lecture 7 (posted on Blackboard).
Quiz 4: Datetime. Due 4pm, Friday, 4 March.
Link to access HackerRank available at the end of Lecture 9 (posted on Blackboard).
Quiz 5: Regular Expressions. Due 4pm, Friday, 11 March.
Link to access HackerRank available at the end of Lecture 11 (posted on Blackboard).
Quiz 6: Summary Statistics. Due 4pm, Friday, 18 March.
Link to access HackerRank available at the end of Lecture 13 (posted on Blackboard).
Quiz 7: Loss Functions. Due 4pm, Friday, 25 March.
Link to access HackerRank available at the end of Lecture 15 (posted on Blackboard).
Quiz 8: Imputing Numerical Values & Fitting Models. Due 4pm, Friday, 1 April.
Link to access HackerRank available at the end of Lecture 17 (posted on Blackboard).
Quiz 9: Feature Engineering & Categorical Encoding. Due 4pm, Friday, 8 April.
Link to access HackerRank available at the end of Lecture 19 (posted on Blackboard).
Quiz 10: Linear Separability & Classifiers. Due 4pm, Friday, 15 April.
Due to the holiday on April 15, the link to access HackerRank will be available at 8am on Thursday, 14 April to allow students who are observing the holiday to be able to complete the quiz on Thursday.
Quiz 11: Dimensionality Reduction. Due 4pm, Friday, 30 April.
Link to access HackerRank available at the end of Lecture 23 (posted on Blackboard).
Quiz 12: Clustering. Due 4pm, Friday, 6 May.
Link to access HackerRank available at the end of Lecture 25 (posted on Blackboard).
Quiz 13: End-of-Semester Survey. Due 4pm, Friday, 13 May.
Available on Blackboard at the end of Lecture 27.
All students registered by Monday, 26 January are sent a registration invitation to the email on record on their Blackboard account. If you did not receive the email or would like to use a different account, write to To encourage starting early on programs, bonus points are given for early submission. A point a day, up to a total of 3 bonus points (10% of the program grade), are possible. The points are prorated by hour. For example, if you turn in the program 36 hours early, then the bonus poins are: (36 hours/3 days)*3 points = (36 hours/72 hours)*3 points = 1.5 points.
To get full credit for a program, the file must include in the opening comment:
Program 1: Popular Names. Due noon, Thursday, 10 February.
Program 2: Parking Tickets. Due noon, Thursday, 17 February.
Program 3: Restaurant Rankings. Due noon, Thursday, 24 February.
Program 4: Restaurant Cleaning. Due noon, Thursday, 3 March.
Program 5: Regex Logs. Due noon, Thursday, 10 March.
Program 6: Housing Units. Due noon, Thursday, 17 March.
Program 7: Housing Model. Due noon, Thursday, 17 March.
Program 8: Yellow Taxi Data. Due noon, Thursday, 31 March.
Program 9: Logistic Taxi. Due noon, Thursday, 7 April.
Program 10: Classifying Digits. Due noon, Thursday, 14 April.
Program 11: Digit Dimensions. Due noon, Thursday, 28 April.
Program 12: EMS Stations. Due noon, Thursday, 5 May.
Program 13: EMS Queries. Due noon, Thursday, 12 May.
The grade for the project is a combination of grades earned on the milestones (e.g. deadlines during the semester to keep the projects on track) and the overall submitted program. If you choose not to complete the project, your final exam grade will replace its portion of the overall grade.
Note: Hunter College is committed to all students having the technology needed for their courses. If you are in need of technology, see
Student Life's Support & Resources Page.
make_dict(file_name, sep=': ')
: Takes a name of a file, file_name
and a delimiter sep
. The default value is ': '
. If a line of the file does not include sep
, the line should be ignored. Otherwise, for each line, the string preceding the delimiter sep
is the key, and the string after sep
is the value. Your function returns the dictionary.
cw2.py
and run on a file containing names that start with 'A', contacts.txt:
will print:
contacts = cw2.make_dict('contacts.txt')
who = 'CS Department'
print(f'Contact info for {who} is {contacts[who]}.')
Contact info for CS Department is 10th Floor HN, x5213.
will print:
nick_names = cw2.make_dict('nick_names.txt', sep = ' ')
names = ['Beth','Lisa','Meg','Greta','Amy','Mia']
for n in names:
print(f'Full name for {n} is {nick_names[n]}.')
Full name for Beth is Elizabeth.
Full name for Lisa is Elizabeth.
Full name for Meg is Margaret.
Full name for Greta is Margaret.
Full name for Amy is Amelia.
Full name for Mia is Amelia.
If you did not attend lecture, do not include the above lines.
"""
Name: YOUR_NAME
Email: YOUR_EMAIL
Resources: RESOURCES USED
I attended lecture today.
Row: YOUR_ROW
Seat: YOUR_SEAT
"""
Grade
is equal to 3 and the Year
is equal to 2019 and write all rows that match that criteria to a new CSV file.
where the file Enter input file name: school-ela-results-2013-2019.csv
Enter output file name: ela2013.csv
school-ela-results-2013-2019.csv
is extracted from NYC Schools Test Results (and truncated version of roughly the first 1000 lines for testing). The first lines of the output file would be:
School,Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
01M015,P.S. 015 ROBERTO CLEMENTE,3,2019,All Students,27,606,1,3.7,7,25.9,18,66.7,1,3.7,19,70.4
01M019, P.S. 019 ASHER LEVY,3,2019,All Students,24,606,0,0.0,8,33.3,15,62.5,1,4.2,16,66.7
01M020,P.S. 020 ANNA SILVER,3,2019,All Students,57,593,13,22.8,24,42.1,18,31.6,2,3.5,20,35.1
Grade
column contains a mixtures of numbers (e.g. 3) and strings ("All Grades"), the column is stored as strings.
If you did not attend lecture, do not include the above lines.
"""
Name: YOUR_NAME
Email: YOUR_EMAIL
Resources: RESOURCES USED
I attended lecture today.
Row: YOUR_ROW
Seat: YOUR_SEAT
"""
Note: Hunter College is committed to all students having the technology needed for their courses. If you are in need of technology, see
Student Life's Support & Resources Page.
re
package in Python) to find all links in the input file and store the link text and URL as columns: Title
and URL
in the CSV file specified by the user. For the URL, strip off the leading https://
or http://
and any trailing slashes (/
):
Then a sample run of the program:
<html>
<head><title>Simple HTML File</title></head>
<body>
<p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>. </p>
<p> And for <a href="https://www.google.com/">google</a>
</body>
</html>
And the Enter input file name: simple.html
Enter output file name: links.csv
links.csv
would be:
Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com
If you did not attend lecture, do not include the above lines.
"""
Name: YOUR_NAME
Email: YOUR_EMAIL
Resources: RESOURCES USED
I attended lecture today.
Row: YOUR_ROW
Seat: YOUR_SEAT
"""
diceSim(D1,D2,trials)
that takes as input the number of sides on die 1 (D1
) and
die2 (D2
) and the number of trials. Your function should repeatedly sum pairs of random numbers between 1 and D1
and 1 and D2
and keep track of how many times each sum occurs. The function returns a numpy array with the fraction each sum of rolls occured.
print(p22.diceSim(6,6,10000))
resulted in:
or displayed using the code from Section 16.1.1.:
[0. 0. 0.0259 0.0615 0.0791 0.1086 0.139 0.1633 0.1385 0.114 0.0833 0.0587 0.0281]
If you did not attend lecture, do not include the above lines.
"""
Name: YOUR_NAME
Email: YOUR_EMAIL
Resources: RESOURCES USED
I attended lecture today.
Row: YOUR_ROW
Seat: YOUR_SEAT
"""
by driving, transit, and walking
(files:
nyc_landmarks_driving.csv,
nyc_landmarks_transit.csv,
nyc_landmarks_walking.csv
).
Quizzes
Unless otherwise noted, quizzes focus on the corresponding programming assignment. The quizzes are 30 minutes long and cannot be repeated. They are available for the 24 hours after lecture and assess your programming skill using HackerRank. Access information for each quiz will be available under the Quizzes menu on Blackboard.
This first coding challenge focuses on reading and processing data from a file using core Python 3.6+ as in Program 1.
This quiz using Pandas and focuses on manipulating and creating new columns in DataFrames as in Program 2.
This is the quiz focuses on aggegrating in Pandas as in Program 3.
This quiz focuses on aggegrating in Pandas as in Program 4.
This quiz focuses on regular expressions in Python as in Program 5.
This quiz focuses on computing statistical properties in Python as in Program 6.
This quiz focuses on computing errors with loss functions in Python as in Program 7.
This quiz focuses on computing errors with loss functions in Python as in Program 8.
This quiz focuses on categorical encoding and fitting models in Python as in Program 9.
This quiz focuses on linear separability and classifiers from
Classwork 18 and Program 10.
This quiz focuses on instrinsic dimensions and dimensionality reduction techniques in Python as in Program 11.
This quiz focuses on clustering techniques in Python as in Program 12.
This quiz is an end-of semester survey.
Homework
Unless otherwise noted, programs are submitted on the course's Gradescope site and are written in Python. The autograders expect a .py
file and do not accept iPython notebooks.
Also, to receive full credit, the code should be compatible with Python 3.6 (the default for the Gradescope autograders).
datasci@hunter.cuny.edu
. Include in your email that you not receive a Gradescope invitation, your preferred email, and we will manually generate an invitation. As a default, we use your name as it appears in Blackboard/CUNYFirst (to update CUNYFirst, see changing your personal information). If you prefer a different name for Gradescope, include it, and we will update the Gradescope registration.
For example, for the student, Thomas Hunter, the opening comment of his first program might be:
and then followed by his Python program.
"""
Name: Thomas Hunter
Email: thomas.hunter1870@hunter.cuny.edu
Resources: Used python.org as a reminder of Python 3 print statements.
"""
Learning Objective: to build competency with string and file I/O functionality of core Python.
Learning Objective: to refresh students' knowledge of Pandas' functionality to manipulate and create columns from formatted data.
Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting and joining data.
Learning Objective: to use regular expressions (pattern matching) with simple patterns to filter data from files.
Learning Objective: to use regular expressions to parse from log data.
Learning Objective: to reinforce Pandas skills by aggregating and cleaning to use in map visualiation, and summary statistics methods in Pandas.
Learning Objective: to enhance on statistical skills and understanding via computation linear regression and loss functions.
Learning Objective: give students practice on implementing model from start to finish and to strengthen understanding of model drift.
Learning Objective: to train and validate models, given quantitative and qualitative data, as well as assessing model quality.
Learning Objective: to train and validate models, given quantitative and qualitative data, as well as assessing model quality.
Learning Objective: to increase facility with standard linear algebra approaches and strengthen understanding of intrinistic dimensions of data sets via exploration of the classic digits dataset).
Learning Objective: to enhance data cleaning skills and build understanding of clustering algorithms.
Available Libraries: pandas, datetime, numpy, sklearn, and core Python 3.6+.
Learning Objective: To reinforce new SQL skills to query and aggregate data.
Available Libraries: pandas, pandasql, and core Python 3.6+.
Project
A final project is optional for this course.
Projects should synthesize the skills acquired in the course to analyze and visualize data on a topic of your choosing. It is your chance to demonstrate what you have learned, your creativity, and a project that you are passionate about. The intended audience for your project is your classmates as well as tech recruiters and potential employers.
Milestones
The project is broken down into smaller pieces that must be submitted by the deadlines below. For details of each milestone, see the links. The project is worth 20% of the final grade. The point breakdown is listed as well as the submission windows and deadlines. All components of the project are submitted via Gradescope unless other noted.
Deadline: | Deliverables: | Points: | Submission Window Opens: |
---|---|---|---|
Monday, 28 February, noon | Opt-In | 14 February | |
Monday, 7 March, noon | Proposal | 50 | 1 March |
Monday, 4 April, noon | Interim Check-In | 25 | 14 March |
Friday, 29 April, |
Complete Project & Website | 100 | 5 April |
Monday, 9 May, noon | Presentation Slides | 25 | 14 April |
Total Points: | 200 |
(50 points)
The window for submitting proposals opens 1 March. If you would like feedback and the opportunity to resubmit for a higher grade, submit early in the window. Feel free to re-submit as many times as you like, up until the assignment deadline. The instructing team will work hard to give feedback on your submission as quickly as possible, and we will grade them in the order they were received.
The proposal is split into the following sections:
The following questions will guide you through some criteria you should be using to assess if the data you have is enough for a successful project.
Hint: Look ahead in the textbook at the chapters on "Linear Modeling" and "Multiple Linear Modeling" for the running examples of models.
Thus, a major part of this final project will center around making the following three types of visualizations with the data you choose. If your data cannot support all three types of visualizations, then please, reconsider choosing another dataset.
(25 points)
Instructions can be found on Gradescope, under "2. Project Interim Check-In".
(100 points)
Submission Instructions:
For example, for the student, Thomas Hunter, the opening comment of his project might be:
"""
Name: Thomas Hunter
Email: thomas.hunter1870@hunter.cuny.edu
Resources: Used python.org as a reminder of Python 3 print statements.
Title: My project
URL: https://www.myproject.com
"""
and then followed by the rest of the Python scripts.
The Gradescope Autograder will check for the Python file and that includes the title, the resources, and the URL of your website. After the submission deadline, the code and the website will be graded manually for code quality and data science inference.
For manual grading of the project, we are grading for the following:
(25 points)
For the last part of the project, include two slides that serve as a graphical overview ("lightning talk" slides) of your project. You should submit to Gradescope, under "4. Project Presentation Slides" a pdf file that contains two slides that summarize your project:
The final exam has two parts:
Logistics:
Exam Rules:
Preparing: The exam covers the material covered in lecture and classwork, programming assignments and quizzes, as well as the reading. The coding exam follows the same style as the quizzes. To prepare:
Logistics:
Exam Rules:
Format and Preparing: