CSci 39542 Syllabus    Resources    Coursework



Coursework
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022


Classwork    Quizzes    Homework    Project    Final Exam   

Classwork

Unless otherwise noted, classwork is submitted via Gradescope. Access information is given during the corresponding lecture.

Due to the internet issues in the lecture hall, for Classwork 2 onward, the classwork will be available until midnight. If you attended class that day, there is an option to earn 0.5 points for attendance and space to include the row and seat number. If you were not able to attend a given lecture, you can still work through the classwork at home and we will replace the fractional point for that classwork with the grade you earned on the final exam. Do not say you were in the room if you did not attend.

Classwork 0: Due midnight, Monday, 31 January.   Available on Gradescope, this classwork focuses on the course syllabus.
If you do have access to the course on Gradescope, write to datasci@hunter.cuny.edu. Include in your email that you not receive a Gradescope invitation, your preferred email, and we will manually generate an invitation.

Classwork 1: Due 4pm, Monday, 31 January.   Available during Lecture 1 on Gradescope (paper version also available for those without a phone or laptop at lecture), this classwork complements the exploratory data analysis of names and foreshadows the sampling of data in Lecture 2.

Classwork 2: Due midnight, Thursday, 3 February.   Available during Lecture 2 on Gradescope, this classwork introduces the autograder that is used for the programming assignments. The structure of the sample program mirrors the structure and content of the upcoming Program 1. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.
Note: Hunter College is committed to all students having the technology needed for their courses. If you are in need of technology, see Student Life's Support & Resources Page.

Write a function that takes the name of a file and makes a dictionary of the lines of the file.

For example, assuming these functions are in a file, cw2.py and run on a file containing names that start with 'A', contacts.txt:

contacts = cw2.make_dict('contacts.txt')
who = 'CS Department'
print(f'Contact info for {who} is {contacts[who]}.')
will print:
Contact info for CS Department is 10th Floor HN, x5213.

Another example with nick_names.txt:

nick_names = cw2.make_dict('nick_names.txt', sep = ' ')
names = ['Beth','Lisa','Meg','Greta','Amy','Mia']
for n in names:
    print(f'Full name for {n} is {nick_names[n]}.')
will print:
Full name for Beth is Elizabeth.
Full name for Lisa is Elizabeth.
Full name for Meg is Margaret.
Full name for Greta is Margaret.
Full name for Amy is Amelia.
Full name for Mia is Amelia.

If you attended lecture, include the last three lines to the the introductory comment:

"""
Name:  YOUR_NAME
Email: YOUR_EMAIL
Resources:  RESOURCES USED
I attended lecture today.
Row:  YOUR_ROW
Seat:  YOUR_SEAT
"""
If you did not attend lecture, do not include the above lines.

Classwork 3: Due midnight, Monday, 7 February.   Available during Lecture 3 on Gradescope, this classwork asks that you write a program using Pandas and its file I/O. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.

Write a program that asks the user for the name of an input CSV file and the name of an output CSV file. The program should open the file name provided by the user. Next, the program should select rows where the field Grade is equal to 3 and the Year is equal to 2019 and write all rows that match that criteria to a new CSV file.

Then a sample run of the program:

Enter input file name: school-ela-results-2013-2019.csv
Enter output file name:  ela2013.csv
where the file school-ela-results-2013-2019.csv is extracted from NYC Schools Test Results (and truncated version of roughly the first 1000 lines for testing). The first lines of the output file would be:
School,Name,Grade,Year,Category,Number Tested,Mean Scale Score,# Level 1,% Level 1,# Level 2,% Level 2,# Level 3,% Level 3,# Level 4,% Level 4,# Level 3+4,% Level 3+4
01M015,P.S. 015 ROBERTO CLEMENTE,3,2019,All Students,27,606,1,3.7,7,25.9,18,66.7,1,3.7,19,70.4
01M019, P.S. 019 ASHER LEVY,3,2019,All Students,24,606,0,0.0,8,33.3,15,62.5,1,4.2,16,66.7
01M020,P.S. 020 ANNA SILVER,3,2019,All Students,57,593,13,22.8,24,42.1,18,31.6,2,3.5,20,35.1

Hints:

If you attended lecture, include the last three lines to the the introductory comment:

"""
Name:  YOUR_NAME
Email: YOUR_EMAIL
Resources:  RESOURCES USED
I attended lecture today.
Row:  YOUR_ROW
Seat:  YOUR_SEAT
"""
If you did not attend lecture, do not include the above lines.

Classwork 4: Due midnight, Thursday, 10 February.   Available during Lecture 4 on HackerRank, this classwork introduces the timed coding environment used for quizzes. This classwork mirrors the structure and content of the upcoming Quiz 1. To get the most out of this exercise, bring an electronic device on which you can easily type into a web-based IDE (possible on a phone, but much easier with the bigger screen and keyboards on some tablets and most laptops.
Note: Hunter College is committed to all students having the technology needed for their courses. If you are in need of technology, see Student Life's Support & Resources Page.

Classwork 5: Due midnight, Monday, 14 February.   Available during Lecture 5 on Gradescope, this classwork focuses on the structure and topics for the optional project, based on the project overview in lecture.

Classwork 6: Due midnight, Thursday, 17 February.   Available during Lecture 6 on Gradescope, this on-line assignment reviews the different ways to merge DataFrames in Pandas.

Classwork 7: Due midnight, Thursday, 24 February.   Available during Lecture 7 on Gradescope, this classwork introduces regular expressions for data cleaning. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.

Write a program that asks the user for the name of an input HTML file and the name of an output CSV file. Your program should use regular expressions (see Chapter 12.4 for using the re package in Python) to find all links in the input file and store the link text and URL as columns: Title and URL in the CSV file specified by the user. For the URL, strip off the leading https:// or http:// and any trailing slashes (/):

For example, if the input file is:


<html>
<head><title>Simple HTML File</title></head>

<body>
  <p> Here's a link for <a href="http://www.hunter.cuny.edu/csci">Hunter CS Department</a>
  and for <a href="https://stjohn.github.io/teaching/data/fall21/index.html">CSci 39542</a>.  </p>

  <p> And for <a href="https://www.google.com/">google</a>
</body>
</html>
  
Then a sample run of the program:
Enter input file name: simple.html
Enter output file name:  links.csv
And the links.csv would be:
Title,URL
Hunter CS Department,www.hunter.cuny.edu/csci
CSci 39542,stjohn.github.io/teaching/data/fall21/index.html
google,www.google.com

If you attended lecture, include the last three lines to the the introductory comment:

"""
Name:  YOUR_NAME
Email: YOUR_EMAIL
Resources:  RESOURCES USED
I attended lecture today.
Row:  YOUR_ROW
Seat:  YOUR_SEAT
"""
If you did not attend lecture, do not include the above lines.

Classwork 8: Due midnight, Monday, 28 February.   Available during Lecture 8 on Gradescope, this classwork focuses on the GeoJSON format, including hands-on activity with GeoJSON visual editor. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.

Classwork 9: Due midnight, Thursday, 3 March.   Available during Lecture 9 on Gradescope, this classwork is modeled on an analytic reasoning challenge of efficiently computing catchment areas (Voronoi diagrams) for NYC libraries. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 10: Due midnight, Monday, 7 March.   Available during Lecture 10 on Gradescope, this classwork builds intuition about how correlated datasets are. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 11: Due midnight, Thursday, 10 March.   Available during Lecture 11 on Gradescope, this classwork reviews probability distributions and sampling. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 12: Due midnight, Monday, 14 March.   Available during Lecture 12 on Gradescope, this classwork focuses on empirical analysis of random variables. To get the most out of this exercise, bring a laptop with you to lecture with a development environment (IDE) that has Python 3+ to work through in lecture.

Write a function:

Since the numbers are chosen at random, the fractions will differ some from run to run. One run of the function print(p22.diceSim(6,6,10000)) resulted in:


  [0.     0.     0.0259 0.0615 0.0791 0.1086 0.139  0.1633 0.1385 0.114  0.0833 0.0587 0.0281]
or displayed using the code from
Section 16.1.1.:

If you attended lecture, include the last three lines to the the introductory comment:

"""
Name:  YOUR_NAME
Email: YOUR_EMAIL
Resources:  RESOURCES USED
I attended lecture today.
Row:  YOUR_ROW
Seat:  YOUR_SEAT
"""
If you did not attend lecture, do not include the above lines.

Classwork 13: Due midnight, Thursday, 17 March.   Available during Lecture 13 on Gradescope, this classwork focuses on gradient descent. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 14: Due 4pm, Monday, 21 March.   Available during Lecture 14 on Gradescope, this classwork was a reviewed the first half of the course, following the topics in DS100 Fall 19 Midterm. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 15: Due 4pm, Thursday, 24 March.   Available during Lecture 15 on Gradescope, this classwork focuses on regularization. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 16: Due 4pm, Monday, 28 March.   Available during Lecture 16 on Gradescope, this classwork focuses on computing loss functions. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 17: Due 4pm, Thursday, 31 March.   Available during Lecture 17 on Gradescope, this classwork focuses on nominal data. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 18: Due 4pm, Monday, 4 April.   Available during Lecture 18 on Gradescope, this classwork focuses on linear separability and classification. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 19: Due 4pm, Thursday, 7 April.   Available during Lecture 19 on Gradescope, this classwork focuses on linear algebra. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 20: Due 4pm, Monday, 11 April.   Available during Lecture 20 on Gradescope, this classwork focuses on intrinsic dimensionality. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 21: Due 4pm, Thursday, 14 April.   Available during Lecture 21 on Gradescope, this classwork focuses on distance metrics. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Using Google Maps API, we generated the amount of time it would take to travel between the following landmarks:

by driving, transit, and walking (files: nyc_landmarks_driving.csv, nyc_landmarks_transit.csv, nyc_landmarks_walking.csv ).

Of the three, which best estimates the (aerial) distance?

Classwork 22: Due 4pm, Monday, 25 April.   Available during Lecture 22 on Gradescope, this classwork focuses on supervised vs. unsupervised learning. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 23: Due 4pm, Monday, 28 April.   Available during Lecture 23 on Gradescope, this classwork focuses on clustering via K-means. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 24: Due midnight, Wednesday, 4 May.   Available during Lecture 24 on Gradescope, this classwork asks your final examination plans (e.g. Do you need a laptop for the coding exam? Would you prefer a left-handed desk for the exam? Do you have accommodations from the Office of Accessability? Do you need to take the exam(s) early?, etc.). If no survey is submitted, the default is that you will take the exams during the regularly assigned times at a right-handed desk and will bring your own laptop to the coding exam.

Classwork 25: Due 4pm, Thursday, 5 May.   Available during Lecture 25 on Gradescope, this classwork focuses on SQL. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 26: Due 4pm, Monday, 9 May.   Available during Lecture 26 on Gradescope, this classwork is a review for the final examinations. To get the most out of this exercise, bring a device with you that can access Gradescope's online assignments.

Classwork 27: Due 4pm, Thursday, 12 May.   Available during Lecture 27 on Gradescope, this classwork is on paper at your assigned seat for the exams (or in the front section, if you have requested to take either exam at an alternate time).




Quizzes

Unless otherwise noted, quizzes focus on the corresponding programming assignment. The quizzes are 30 minutes long and cannot be repeated. They are available for the 24 hours after lecture and assess your programming skill using HackerRank. Access information for each quiz will be available under the Quizzes menu on Blackboard.

Quiz 1: Core Python.Due 4pm, Friday, 11 February.   Link to access HackerRank available at the end of Lecture 4 (posted on Blackboard).
This first coding challenge focuses on reading and processing data from a file using core Python 3.6+ as in
Program 1.

Quiz 2: Pandas Basics.Due 4pm, Friday, 18 February.   Link to access HackerRank available at the end of Lecture 6 (posted on Blackboard).
This quiz using Pandas and focuses on manipulating and creating new columns in DataFrames as in
Program 2.

Quiz 3: Aggregating in Pandas.Due 4pm, Friday, 25 February.   Link to access HackerRank available at the end of Lecture 7 (posted on Blackboard).
This is the quiz focuses on aggegrating in Pandas as in
Program 3.

Quiz 4: Datetime.Due 4pm, Friday, 4 March.   Link to access HackerRank available at the end of Lecture 9 (posted on Blackboard).
This quiz focuses on aggegrating in Pandas as in
Program 4.

Quiz 5: Regular Expressions.Due 4pm, Friday, 11 March.   Link to access HackerRank available at the end of Lecture 11 (posted on Blackboard).
This quiz focuses on regular expressions in Python as in
Program 5.

Quiz 6: Summary Statistics. Due 4pm, Friday, 18 March.   Link to access HackerRank available at the end of Lecture 13 (posted on Blackboard).
This quiz focuses on computing statistical properties in Python as in
Program 6.

Quiz 7: Loss Functions. Due 4pm, Friday, 25 March.   Link to access HackerRank available at the end of Lecture 15 (posted on Blackboard).
This quiz focuses on computing errors with loss functions in Python as in
Program 7.

Quiz 8: Imputing Numerical Values & Fitting Models. Due 4pm, Friday, 1 April.   Link to access HackerRank available at the end of Lecture 17 (posted on Blackboard).
This quiz focuses on computing errors with loss functions in Python as in
Program 8.

Quiz 9: Feature Engineering & Categorical Encoding.Due 4pm, Friday, 8 April.   Link to access HackerRank available at the end of Lecture 19 (posted on Blackboard).
This quiz focuses on categorical encoding and fitting models in Python as in
Program 9.

Quiz 10: Linear Separability & Classifiers.Due 4pm, Friday, 15 April.Due to the holiday on April 15, the link to access HackerRank will be available at 8am on Thursday, 14 April to allow students who are observing the holiday to be able to complete the quiz on Thursday.
This quiz focuses on linear separability and classifiers from
Classwork 18 and Program 10.

Quiz 11: Dimensionality Reduction.Due 4pm, Friday, 30 April.   Link to access HackerRank available at the end of Lecture 23 (posted on Blackboard).
This quiz focuses on instrinsic dimensions and dimensionality reduction techniques in Python as in
Program 11.

Quiz 12: Clustering.Due 4pm, Friday, 6 May.   Link to access HackerRank available at the end of Lecture 25 (posted on Blackboard).
This quiz focuses on clustering techniques in Python as in
Program 12.

Quiz 13: End-of-Semester Survey.Due 4pm, Friday, 13 May.   Available on Blackboard at the end of Lecture 27.
This quiz is an end-of semester survey.




Homework

Unless otherwise noted, programs are submitted on the course's Gradescope site and are written in Python. The autograders expect a .py file and do not accept iPython notebooks. Also, to receive full credit, the code should be compatible with Python 3.6 (the default for the Gradescope autograders).

All students registered by Monday, 26 January are sent a registration invitation to the email on record on their Blackboard account. If you did not receive the email or would like to use a different account, write to datasci@hunter.cuny.edu. Include in your email that you not receive a Gradescope invitation, your preferred email, and we will manually generate an invitation. As a default, we use your name as it appears in Blackboard/CUNYFirst (to update CUNYFirst, see changing your personal information). If you prefer a different name for Gradescope, include it, and we will update the Gradescope registration.

To encourage starting early on programs, bonus points are given for early submission. A point a day, up to a total of 3 bonus points (10% of the program grade), are possible. The points are prorated by hour. For example, if you turn in the program 36 hours early, then the bonus poins are: (36 hours/3 days)*3 points = (36 hours/72 hours)*3 points = 1.5 points.

To get full credit for a program, the file must include in the opening comment:

For example, for the student, Thomas Hunter, the opening comment of his first program might be:

"""
Name:  Thomas Hunter
Email: thomas.hunter1870@hunter.cuny.edu
Resources:  Used python.org as a reminder of Python 3 print statements.
"""
and then followed by his Python program.



Program 1: Popular Names.Due noon, Thursday, 10 February.
Learning Objective: to build competency with string and file I/O functionality of core Python.

Program 2: Parking Tickets.Due noon, Thursday, 17 February.
Learning Objective: to refresh students' knowledge of Pandas' functionality to manipulate and create columns from formatted data.

Program 3: Restaurant Rankings.Due noon, Thursday, 24 February.
Learning Objective: students can successfully filter formatted data using standard Pandas operations for selecting and joining data.

Program 4: Restaurant Cleaning.Due noon, Thursday, 3 March.
Learning Objective: to use regular expressions (pattern matching) with simple patterns to filter data from files.

Program 5: Regex Logs.Due noon, Thursday, 10 March.
Learning Objective: to use regular expressions to parse from log data.

Program 6: Housing Units.Due noon, Thursday, 17 March.
Learning Objective: to reinforce Pandas skills by aggregating and cleaning to use in map visualiation, and summary statistics methods in Pandas.

Program 7: Housing Model.Due noon, Thursday, 17 March.
Learning Objective: to enhance on statistical skills and understanding via computation linear regression and loss functions.

Program 8: Yellow Taxi Data.Due noon, Thursday, 31 March.
Learning Objective: give students practice on implementing model from start to finish and to strengthen understanding of model drift.

Program 9: Logistic Taxi.Due noon, Thursday, 7 April.
Learning Objective: to train and validate models, given quantitative and qualitative data, as well as assessing model quality.

Program 10: Classifying Digits.Due noon, Thursday, 14 April.
Learning Objective: to train and validate models, given quantitative and qualitative data, as well as assessing model quality.

Program 11: Digit Dimensions.Due noon, Thursday, 28 April.
Learning Objective: to increase facility with standard linear algebra approaches and strengthen understanding of intrinistic dimensions of data sets via exploration of the classic digits dataset).

Program 12: EMS Stations.Due noon, Thursday, 5 May.
Learning Objective: to enhance data cleaning skills and build understanding of clustering algorithms.
Available Libraries: pandas, datetime, numpy, sklearn, and core Python 3.6+.

Program 13: EMS Queries.Due noon, Thursday, 12 May.
Learning Objective: To reinforce new SQL skills to query and aggregate data.
Available Libraries: pandas, pandasql, and core Python 3.6+.





Project

A final project is optional for this course. Projects should synthesize the skills acquired in the course to analyze and visualize data on a topic of your choosing. It is your chance to demonstrate what you have learned, your creativity, and a project that you are passionate about. The intended audience for your project is your classmates as well as tech recruiters and potential employers.

The grade for the project is a combination of grades earned on the milestones (e.g. deadlines during the semester to keep the projects on track) and the overall submitted program. If you choose not to complete the project, your final exam grade will replace its portion of the overall grade.

Milestones

The project is broken down into smaller pieces that must be submitted by the deadlines below. For details of each milestone, see the links. The project is worth 20% of the final grade. The point breakdown is listed as well as the submission windows and deadlines. All components of the project are submitted via Gradescope unless other noted.

Deadline:Deliverables:Points: Submission Window Opens:
Monday, 28 February, noon Opt-In 14 February
Monday, 7 March, noon Proposal 50 1 March
Monday, 4 April, noon Interim Check-In 25 14 March
Friday, 29 April, midnight Complete Project & Website 100 5 April
Monday, 9 May, noonPresentation Slides 25 14 April
Total Points: 200




Project Opt-In

Review the following FAQs before filling out the Project Opt-In form (available on Gradescope on 14 February).

Project Proposal

(50 points)

The window for submitting proposals opens 1 March. If you would like feedback and the opportunity to resubmit for a higher grade, submit early in the window. Feel free to re-submit as many times as you like, up until the assignment deadline. The instructing team will work hard to give feedback on your submission as quickly as possible, and we will grade them in the order they were received.

The proposal is split into the following sections:


Project Interim Check-In

(25 points)

Instructions can be found on Gradescope, under "2. Project Interim Check-In".



Final Project & Website Submission

(100 points)

Submission Instructions:

For example, for the student, Thomas Hunter, the opening comment of his project might be:


"""
Name:       Thomas Hunter
Email:      thomas.hunter1870@hunter.cuny.edu
Resources:  Used python.org as a reminder of Python 3 print statements.
Title:      My project
URL:        https://www.myproject.com
"""
and then followed by the rest of the Python scripts.

The Gradescope Autograder will check for the Python file and that includes the title, the resources, and the URL of your website. After the submission deadline, the code and the website will be graded manually for code quality and data science inference.

For manual grading of the project, we are grading for the following:


Presentation Slides

(25 points)

For the last part of the project, include two slides that serve as a graphical overview ("lightning talk" slides) of your project. You should submit to Gradescope, under "4. Project Presentation Slides" a pdf file that contains two slides that summarize your project:

It's completely acceptable to re-use what you wrote on the website and the data visualizations you've submitted for "3. Final Project & Website Submission" here.






Final Examination

The final exam has two parts:

Both parts are required and are comprehensive, covering all the material of the course. For grading details, see the syllabus.

Coding Examination

The coding exam is on Monday, 16 May, 2:45-4pm, in the style of the weekly course quizzes.

Logistics:

Exam Rules:

Preparing: The exam covers the material covered in lecture and classwork, programming assignments and quizzes, as well as the reading. The coding exam follows the same style as the quizzes. To prepare:



Written Examination

The written exam is Monday, 23 May, 1:45-3:45pm.

Logistics:

Exam Rules:

Format and Preparing: