CSci 39542 Syllabus    Resources    Coursework



CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022

For questions about the course, write to: datasci AT hunter cuny edu or visit office hours.

Announcements:


Calendar:

 Theme:                       Topics: Coursework: Reading:
Data Science Overview #1: Monday,
31 January
Syllabus & Class Policies;

Data Science Lifecycle: Question Formulation, Data Acquisition and Cleaning, Exploratory Data Analysis, Prediction and Inference

Python Recap: basics, dictionaries, & keyword parameters
CW1
DS 100: Chapter 1 (Data Science Lifecycle),
Think CS: Chapter 12 (Dictionaries),
python.org: Section 4.7 (Functions)
#2: Thursday,
3 February
Data Scope, Big Data, Accuracy

Python Recap: file I/O and string methods
CW2
DS 100: Chapter 2 (Data Scope),
DS 100: Section 13.1 (String Methods),
Think CS: Chapter 11 (Files)
#3: Monday,
7 February
Theory for Data Design, Sampling Variation, Measurement Error

Data Representation, DataFrames (Pandas), Lambda Expressions & Applying Functions
CW3 DS 100: Chapter 3 (Data Design),
DS 100: Chapter 6 (DataFrames),
python.org: Section 4.7 (Functions)
#4: Thursday,
10 February
Modeling and Estimation

DataFrame Basics, List comprehensions & zips
CW4

Program 1
Quiz 1
DS 100: Chapter 4 (Modeling and Estimation),
DS 100: Chapter 6 (DataFrames),
Think CS: Section 10.23 (List Comprehensions),
Zip Tutorial (RealPython),
Constructing DataFrames (pydata.org)
Representation & Visualization #5: Monday,
14 February
Data Representation: Structure & Granularity

Manipulating Data in Pandas: More on Subsetting & Aggregating DataFrames

Project Overview
CW5

DS 100: Chapter 6 (DataFrames),
DS 100: Section 8.5 (Table Shape & Granularity)
#6: Thursday,
17 February
Data Quality & Wrangling

Joining & Transforming Data in Pandas
CW6

Program 2
Quiz 2
DS 100: Chapter 6 (DataFrames),
DS 100: Chapter 9 (Data Wrangling),
21 February President Day: College Closed
#7: Thursday,
24 February
Features, Visualizing Qualitative & Quantitative Data

Regular Expressions
CW7

Program 3
Quiz 3
DS 100: Chapter 10 (Exploratory Data Analysis),
DS 100: Sections 11.1-11.3 (Data Visualization),
DS 100: Sections 13.2-3 (Regular Expressions)
#8: Monday,
28 February
Customizing Plots in matplotlib & seaborn

Time-Series Data, GeoJSON Format
CW8


Opt-in for Optional Project
DS 100: Sections 11.1-11.3 (Data Visualization),
Hands On ML (Matplotlib Tools),
GeoJSON Editor
#9: Thursday,
3 March
Visualization Principles

Visualizing GIS Data: Choropleth Maps, Voronoi Diagrams
CW9

Program 4
Quiz 4
DS 100: Chapter 11 (Data Visualization),
Folium documentation
Models & Loss Functions #10: Monday,
7 March
Linear Models, Predicting Tip Amounts

Statistics Recap: Expectation, Variance, Correlation, & Sampling
CW10

Proposal for Optional Project
DS 100: Chapter 15 (Linear Models),
Seeing Theory (Brown U),
Guessing Correlation Coefficients (UBC),
Computing Correlations (Real Python)
#11: Thursday,
10 March
Probability and Generalization: Distributions, Probability Mass Functions

Sampling & Confidence Intervals
CW11

Program 5
Quiz 5
DS 100: Chapter 16 (Probability & Generalization),
Sampling from a Normally Distributed Population (UBC),
Confidence Intervals (UBC),
Residuals (UBC)
#12: Monday,
14 March
Central Limit Theorem

Fitting Models: Convexity, Least Squares, & Validating

Loss Functions: Mean Squared and Mean Absolute Error
CW12 Central Limit Theorem (UBC),
DS 8: Chapter 15 (Prediction),
DS 100, Sections 4.2-3 (Loss Functions),
Multiple Linear Modeling #13: Thursday,
17 March
Gradient Descent

Multiple Linear Regression
CW13

Program 6
Quiz 6
DS 100: Chapter 17 (Gradient Descent),
Gradient Descent Visualization (Lili Jiang),
DS 100: Chapter 19 (Multiple Linear Regression),
#14: Monday,
21 March
Feature Engineering: Variable Transformations & Categorical Encoding; Bias-Variance Tradeoff

Code Demo: Walmart Sales
CW14 DS 100: Chapter 20 (Feature Engineering),
DS 100: Exam Resources
#15: Thursday,
24 March
Regularization: Ridge Regularization (L2) & Lasso Regularization (L1)

Classwork: Modeling & Regularization
CW15

Program 7
Quiz 7
DS 100: Chapter 20 (Feature Engineering)
DS 100: Chapter 21 (Bias-Variance Tradeoff),
DS 100: Chapter 22 (Regularization)
Classification #16: Monday,
28 March
Regression on Probabilities; The Logistic Model & Loss Function;

Serializing and Evaluating Models
CW16 DS 100: Chapter 24 (Classification),
Confusion Matrices (sklearn),
Python Object Serialization Docs (Pickling)
#17: Thursday,
31 March
Using Logistic Models: Fitting & Evaluating a Logistic Model;

Cross Validation & Metrics
CW17

Program 8
Quiz 8
DS 100: Chapter 24 (Classification),
DS 100: Chapter 21 (Cross Validation),
#18: Monday,
4 April
One-Versus-Rest (OVR) Classification

Other Approaches: Naive Bayes, Support Vector Machines (SVM's), Decision Trees & Random Forests

CW18

Project: Interim Check-In
Python DS Handbook Chapter 5 (SVMs),
Karparthy's SVM Demo (Stanford),
Data Camp Tutorial (SVM's)
SVM's (sklearn),
Recognizing Hand-Written Digits (sklearn)
Dimensionality Reduction #19: Thursday,
7 April
Linear Algebra Recap: Vectors, Matrices, Eigenvectors & Eigenvalues

Principal Components Analysis
CW19

Program 9
Quiz 9
Explained Visually (Eigenvectors and Eigenvalues),
Explained Visually (Principal Components Analysis),
Linear Algebra Review (MIT),
DS 100: Section 26.1 (PCA Dimensions)
#20: Monday,
11 April
PCA as Dimensionality Reduction; Intrinistic Dimensionality (Scree Plots)

Multiple Dimensional Scaling
CW20 DS 100: Chapter 26 (PCA)
Python DS Handbook: Section 5.09 (PCA),
Python DS Handbook Section 5.10 (Manifold Learning)
#21: Thursday,
14 April
Non-Euclidean Distances;

Non-Linear Dimensionality Reduction: t-SNE, UMAP
CW21

Program 10
Quiz 10
Manifold Learning (sklearn),
Python DS Handbook Section 5.10 (Manifold Learning)
15-22 April Spring Break: No Classes
Clustering #22: Monday,
25 April
Supervised vs. Unsupervised Learning;

K-Means Clustering
CW22
Complete Project & Website
Supervised vs. Unsupervised Learning (IBM),
DS 100: Chapter 28 (clustering),
K-Means gif (wiki),
Python DS Handbook: Section 5.11 (K-Means)
#23: Thursday,
28 April
K-Means: Clustering Complexity, Lloyd's Algorithm (Naive K-Means), MiniBatch K-Means

Gaussian Mixture Models
CW 23

Program 11
Quiz 11
DS 100: Chapter 28 (clustering),
Python DS Handbook: Section 5.11 (K-Means),
Python DS Handbook: Section 5.12 (Gaussian Mixture Models)
#24: Monday,
2 May
Other Clustering Approaches;

Replicability, P-Hacking, A/B testing
CW24

Cluster Analysis (wiki),
DS Chapter 25 (Replicability)
More on Structured Data #25: Thursday,
5 May
Relational Databases and SQL

Code Demo: SQL in Python: setting up a database, basic SQL
CW25

Program 12
Quiz 12

DS 100: Chapter 7 (Relational Databases & SQL)
#26: Monday,
9 May
Relational Databases: Subsetting, Aggregating, Joining, & Transforming Data CW26

DS 100: Chapter 7 (Relational Databases & SQL)
Review #27: Thursday,
12 May
Project Showcase

Semester Review
CW27

Program 13
Quiz 13: End-of-semester Survey

Monday, 16 May, 2:45-4pm Final Exam: Coding
Monday, 23 May, 1:45-3:45pmFinal Exam: Written
(This file was last modified on 27 May 2022.)