CSci 39542 Syllabus    Resources    Coursework

## CSci 39542: Introduction to Data ScienceDepartment of Computer ScienceHunter College, City University of New York Fall 2021

### Announcements:

• The projects (with consent of authors).
• The course is full, and we do not expect additional seats to become available for the fall term. We anticipate that the course will be offered again in the spring.
• Enrolled Students: for the quickest response to questions, use the links on the course Blackboard page:
• For questions of general interest, use `Help::General Questions`.
• For questions specific to you (e.g. resending Gradescope invitations, questions about grading, etc.), use `Help::Individual Questions`.

### Calendar:

Data Science Overview #1: Thursday,
26 August
Syllabus & Class Policies;
Data Science Lifecycle: Question Formulation, Data Acquisition and Cleaning, Exploratory Data Analysis, Prediction and Inference

Code Demo: Textbook: Predicting Ages from SSN Data

Code Demo: Python Recap: basics & standard packages (pandas, numpy, matplotlib, & seaborn), zips and list comprehensions

P1: Hello, world
P2: Senators' Names
DS 100: Chapter 1 (Data Science Lifecycle)
#2: Monday,
30 August
Guest Speaker: Elise Harris, Coordinator for Tech Internships and External Partnerships: Tech Internships

Exploratory Data Analysis, Generalizing from Data, Data Sampling, Probability Sampling

Classwork: Are Senators older than Representatives?

Code Demo: Python string methods
Q2: Python Recap

P3: Senators' Ages
P4: ELA Proficiency
DS 100: Chapter 2 (Data Scope),
DS 100: Chapter 3 (Data Design),
DS 100: Section 13.1 (Python String Methods)
Rectangular Data #3: Thursday,
2 September
Data Representation: standard primitive types, rectangular data, data tables in Python (Pandas)

Code Demo: Regular Expressions

Guest Speaker: Provost Valeda Dent: The Rural Village Libraries Research Network project

Classwork: Measuring impact of libraries in NYC communities
Q3: Data Sampling

P5: URL Collection
P6: Regex on Restaurant Inspection Data
DS 100: Chapter 7 (Data Tables in Python)
DS 100: Sections 13.2-3 (Regular Expressions)
6 September Labor Day: College Closed
#4: Thursday,
9 September
Relational Databases and SQL

Code Demo: SQL in Python: setting up a database, basic SQL

Q4: Python Strings & Data Types

P7: Neighborhood Tabulation Areas
P8: Restaurant SQL Queries
DS 100: Chapter 6 (Relational Databases & SQL)
#5: Monday,
13 September
Aggregating Data in SQL and Pandas:

Code Demo: Revisiting Python Functions: Applying Functions to Tables
Q5: Coding Quiz

P9: Aggregating Restaurant Data (SQL)
P10: Extracting Districts
DS 100: Chapter 6 (Relational Databases & SQL),
DS 100: Chapter 7 (Data Tables in Python),
python.org: Section 4.7 (More on Defining Functions)
16 September No Class
#6: Monday,
20 September
Joining Data in SQL and Pandas

Classwork: Combining NYC schools data

Code Demo: Lambda Expressions

Project Overview
Q6: Regular Expressions

P11: Joining Restaurant & NTA Data
Project Pre-Proposal Window Opens
P12: MTA Ridership
DS 100: Chapter 6 (Relational Databases & SQL),
DS 100: Chapter 7 (Data Tables in Python),
python.org: Section 4.7 (More on Defining Functions)
Data Visualization #7: Thursday,
23 September
Plotting Numerical & Categorical Data, Time-Series Data

Code Demo: Customizing Plots in matplotlib & seaborn

Classwork: Plotting MTA Ridership Data

Code Demo: Revisiting Python Functions: Defaults, Keywords, Unpacking Argument Lists
Q7: SQL

P13: Column Summaries
P14: Library Cleaning
DS 100, Chapter 8 (Data Representation)
DS 100: Chapter 9 (Data Quality),
DS 100: Sections 11.1-11.3 (Data Visualization)
Matplotlib Tools (Hands On ML)
#8: Monday,
27 September
Visualizing GIS Data, GeoJSON, Choropleth Maps, Voronoi Diagrams

Code Demo: Interactive Library Maps, School District Choropleth Maps

Classwork: GeoJSON Editor
Q8: Data Frames (Python)

P15: Plotting Challenge
P16: Choropleth Attendance Cleaning
DS 100: Sections 11.1-11.3 (Data Visualization)
Folium documentation,
GeoJSON Editor
#9: Thursday,
30 September
Visualization Principles

Code Demo: Voronoi Diagrams,

Classwork: Altair: declarative visualization techniques
Q9: Python Functions

P17: Grouping ELA/Math by Districts
P18: Log Scale
DS 100: Chapter 11 (Data Visualization),
Altair overview
Altair maps (gallery of case studies)
Models & Loss Functions #10: Monday,
4 October
Probability Distributions

Loss Functions: Mean Squared and Mean Absolute Error

Code Demo: Textbook: Restaurant Tips
Q10: Coding Quiz

P19: Smoothing with Gaussians
Project Pre-Proposal
P20: Loss Functions for Tips
DS 100, Sections 4.2-3 (Loss Functions),
Seeing Theory (Brown U),
Sampling from a Normally Distributed Population (UBC)
#11: Thursday,
7 October
More on Loss Functions: Huber Loss, Properties of Different Loss Functions

Correlation, Linear Regression, Residuals
Q11: Data Visualization

P21: Taxi Cleaning
P22: Dice Simulator
DS 100: Chapter 4 (Modeling Intro),
DS 100: Chapter 15 (Linear Models),
DS 8: Chapter 15 (Prediction),
Guessing Correlation Coefficients (UBC),
Residuals (UBC)
11 October No Class
#12: Thursday,
14 October
Least Squares

Probability Overview: Distributions

Probability Mass Functions, Central Limit Theorem

Classwork: Predicting taxi tips & costs (NYC OpenData Yellow Taxi Data)
Q12: Loss Functions

P23: Correlation Coefficients
P24: Enrollments
DS 8: Chapter 9 (Randomness),
DS 100: Chapter 16 (Probability & Generalization)
Central Limit Theorem (UBC),
DS 100: Chapter 17 (Gradient Descent)
#13: Monday,
18 October
Recap: Confidence Intervals;

Q13: Probability & Risk

P25: PMF of Senators' Ages
Project: Title & Proposal
P26: Weekday Entries
DS 100: Chapter 17 (Gradient Descent),
Confidence Intervals (UBC)
Multiple Linear Modeling #14: Thursday,
21 October
Multiple Linear Regression; More on Gradient Descent Q14: Linear Models & Gradient Descent

P27: Fitting OLS
P28: CS Courses
DS 100: Chapter 19 (Multiple Linear Regression)
#15: Monday,
25 October
Feature Engineering Overview

Code Demo: Walmart Sales
Q15: Coding Quiz

P29: Predictions with MLM's
Project: Peer Review #1
P30: Computing Ranges
DS 100: Chapter 20 (Feature Engineering)
#16: Thursday,
28 October
Feature Engineering: Variable Transformations; Bias-Variance Tradeoff

Classwork: Friday Attendance

Code Demo: Ice Cream Ratings
Q16: Review

P31: Sampling Distributions
P32: Attendance
DS 100: Chapter 20 (Feature Engineering)
DS 100: Chapter 21 (Bias-Variance Tradeoff)
Classification #17: Monday,
1 November
Feature Engineering: Categorical Encoding; Overview of Regularization; Regression on Probabililities; The Logistic Model & Loss Function;

Classwork: 311 Pothole Dataset
Q17: Feature Engineering

P33: Confidence Intervals
P34: Polynomial Features
DS 100: Chapter 22 (Regularization)
DS 100: Chapter 24 (Classification)
#18: Thursday,
4 November
Using Logistic Models: Approximating the Empirical Probability Distribution; Fitting & Evaluating a Logistic Model;

Code Demo: Free Throws
Q18: Logistic Model

P35: Parking Tickets
P36: Multiple Locations
DS 100: Chapter 24 (Classification)
#19: Monday,
8 November
Evaluating Logistic Models; Confusion Matrices

Classwork: Binary Digit Classifier

Support Vector Machines (SVM)
Q19: Logistic Regression

P37: Score Predictor
Project Check-in #1 (Data Collection)
P38: Ticket Prep
DS 100: Chapter 24 (Classification)
Python DS Handbook Chapter 5 (SVMs)
Karparthy's SVM Demo (Stanford)
#20: Thursday,
11 November
More on SVM's; One-Versus-Rest (OVR) Classification; Survey of Classifier Techniques

Classwork: Iris Classification
Q20: Coding Quiz

P39: Binary Digit Classifier
P40: Enrollment by Courses
Data Camp Tutorial (SVM's)
SVM's (sklearn)
SKLearn (Recognizing Hand-Written Digits)
Dimensionality Reduction #21: Monday,
15 November
Linear Algebra Recap; Principal Components Analysis

Code Demo: Explained Visually: Eigenvectors & Eigenvalues
Q21: Classification

41: Classifier Misses
Project Check-in #2 (Analysis)
P42: Ticket Predictor
Explained Visually (Eigenvectors and Eigenvalues)
Explained Visually (Principal Components Analysis)
Linear Algebra Review (MIT)
DS 100: Section 26.1 (PCA Dimensions)
#22: Thursday,
18 November
PCA as Dimensionality Reduction; Optimal Number of Components

Multiple Dimensional Scaling
Q22: Linear Algebra

P43: Moving
P44: Model Comparison
DS 100: Chapter 26 (PCA)
Python Data Science Handbook: Section 5.9 (PCA)
#23: Monday,
22 November
Non-Linear Dimensionality Reduction: t-SNE, UMAP

Project Update: Presentation Details: Abstract, Website, and Presentation
Q23: PCA

P45: Component Retention
Project Check-in #3 (Visualization)
P46: Digits Components
Manifold Learning (sklearn)
25-26 November Thanksgiving Break: College Closed
Clustering #24: Monday,
29 November
Recap: Dimensionality Reduction;

Code Demo: transit vs. Euclidean distances

K-Means Clustering
Q24: MDS

P47: Voting MDS
Project: Draft Abstract & Website
P48: Transit Distances
DS 100: Chapter 28 (clustering)
Wiki K-Means (gif)
Python Data Science Handbook: Section 5.11 (K-Means)
#25: Thursday,
2 December
K-Means: Clustering Complexity, Lloyd's Algorithm (Naive K-Means), MiniBatch K-Means
Q25: Coding Quiz

P49: Toy Clusters
Project: Peer Review #2
DS 100: Chapter 28 (clustering)
Python Data Science Handbook: Section 5.9 (K-Means)
#26: Monday,
6 December
Other Clustering Approaches;

Supervised vs. Unsupervised Learning
Q26: K-Means Clustering

P50: 4-Coloring
Project: Abstract
DS 100: Chapter 28 (clustering)
Wiki Cluster Analysis
Supervised vs. Unsupervised Learning (IBM)
Replicability #27: Thursday,
9 December
Replicability, P-Hacking, A/B testing

Classwork: Coding Challenges (Core Python Recap)
Q27: Review

Project: Website
Project: Presentation Slides
DS Chapter 25 (Replicability)
Review #28: Monday,
13 December
Review Q28: End-of-semester Survey

Monday, 20 December,
1:45-3:45pm
Final Exam