CSci 39542 Syllabus    Resources    Coursework



CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Fall 2021

Announcements:


Calendar:

 Theme:                       Topics: Coursework: Reading:
Data Science Overview #1: Thursday,
26 August
Syllabus & Class Policies;
Data Science Lifecycle: Question Formulation, Data Acquisition and Cleaning, Exploratory Data Analysis, Prediction and Inference

Code Demo: Textbook: Predicting Ages from SSN Data

Code Demo: Python Recap: basics & standard packages (pandas, numpy, matplotlib, & seaborn), zips and list comprehensions
Q1: Academic Integrity

P1: Hello, world
P2: Senators' Names
DS 100: Chapter 1 (Data Science Lifecycle)
#2: Monday,
30 August
Guest Speaker: Elise Harris, Coordinator for Tech Internships and External Partnerships: Tech Internships

Exploratory Data Analysis, Generalizing from Data, Data Sampling, Probability Sampling

Classwork: Are Senators older than Representatives?

Code Demo: Python string methods
Q2: Python Recap

P3: Senators' Ages
P4: ELA Proficiency
DS 100: Chapter 2 (Data Scope),
DS 100: Chapter 3 (Data Design),
DS 100: Section 13.1 (Python String Methods)
Rectangular Data #3: Thursday,
2 September
Data Representation: standard primitive types, rectangular data, data tables in Python (Pandas)

Code Demo: Regular Expressions

Guest Speaker: Provost Valeda Dent: The Rural Village Libraries Research Network project

Classwork: Measuring impact of libraries in NYC communities
Q3: Data Sampling

P5: URL Collection
P6: Regex on Restaurant Inspection Data
DS 100: Chapter 7 (Data Tables in Python)
DS 100: Sections 13.2-3 (Regular Expressions)
6 September Labor Day: College Closed
#4: Thursday,
9 September
Relational Databases and SQL

Code Demo: SQL in Python: setting up a database, basic SQL

Q4: Python Strings & Data Types

P7: Neighborhood Tabulation Areas
P8: Restaurant SQL Queries
DS 100: Chapter 6 (Relational Databases & SQL)
#5: Monday,
13 September
Aggregating Data in SQL and Pandas:

Code Demo: Revisiting Python Functions: Applying Functions to Tables
Q5: Coding Quiz

P9: Aggregating Restaurant Data (SQL)
P10: Extracting Districts
DS 100: Chapter 6 (Relational Databases & SQL),
DS 100: Chapter 7 (Data Tables in Python),
python.org: Section 4.7 (More on Defining Functions)
16 September No Class
#6: Monday,
20 September
Joining Data in SQL and Pandas

Classwork: Combining NYC schools data

Code Demo: Lambda Expressions

Project Overview
Q6: Regular Expressions

P11: Joining Restaurant & NTA Data
Project Pre-Proposal Window Opens
P12: MTA Ridership
DS 100: Chapter 6 (Relational Databases & SQL),
DS 100: Chapter 7 (Data Tables in Python),
python.org: Section 4.7 (More on Defining Functions)
Data Visualization #7: Thursday,
23 September
Plotting Numerical & Categorical Data, Time-Series Data

Code Demo: Customizing Plots in matplotlib & seaborn

Classwork: Plotting MTA Ridership Data

Code Demo: Revisiting Python Functions: Defaults, Keywords, Unpacking Argument Lists
Q7: SQL

P13: Column Summaries
P14: Library Cleaning
DS 100, Chapter 8 (Data Representation)
DS 100: Chapter 9 (Data Quality),
DS 100: Sections 11.1-11.3 (Data Visualization)
Matplotlib Tools (Hands On ML)
#8: Monday,
27 September
Visualizing GIS Data, GeoJSON, Choropleth Maps, Voronoi Diagrams

Code Demo: Interactive Library Maps, School District Choropleth Maps

Classwork: GeoJSON Editor
Q8: Data Frames (Python)

P15: Plotting Challenge
Project Pre-Proposal
P16: Choropleth Attendance Cleaning
DS 100: Sections 11.1-11.3 (Data Visualization)
Folium documentation,
GeoJSON Editor
#9: Thursday,
30 September
Visualization Principles

Code Demo: Voronoi Diagrams,

Classwork: Altair: declarative visualization techniques
Q9: Python Functions

P17: Grouping ELA/Math by Districts
P18: Altair Challenge
DS 100: Chapter 11 (Data Visualization),
Altair overview
Altair maps (gallery of case studies)
Models & Loss Functions #10: Monday,
4 October
More on Modeling and Estimation, Introduction to Models & Loss Functions: Absolute and Huber Loss



Code Demo: Textbook: Restaurant Tips
Q10: Coding Quiz

P19: Modeling Restaurant Tips
Project: Title & Proposal
P20: Taxi Cleaning
DS 100, Sections 4.2-3 (Loss Functions),
#11: Thursday,
7 October
Linear Regression, Least Squares

Classwork: Predicting taxi tips & costs (NYC OpenData Yellow Taxi Data)
Q11: Data Visualization

P21: Taxi Tips
Project: Peer Review #1
P22: Dice Simulator
DS 100: Chapter 4 (Modeling Intro),
DS 100: Chapter 15 (Linear Models)
11 October No Class
#12: Thursday,
14 October
Expectation & Variance, Risk, Gradient Descent

Classwork: Simulating Randomness

Code Demo: Gradient Descent
Q12: Probability & Risk

P23: PMF of Senators' Ages
P24: Fitting LM's to Taxi Data
DS 8: Chapter 9 (Randomness),
DS 100: Chapter 16 (Probability & Generalization)
DS 100: Chapter 17 (Gradient Descent)
#13: Monday,
18 October
Stochastic Gradient Descent, Convexity, Fitting Models Q13: Loss Functions

P25:
Project: Timeline
P26:
DS 100: Chapter 17 (Gradient Descent)
Multiple Linear Modeling #14: Thursday,
21 October
Multiple Linear Regression Q14: Linear Regression

P27: MLM's for Taxi Trips
P28:
DS 100: Chapter 19 (Multiple Linear Regression)
#15: Monday,
25 October
Feature Engineering, Bias-Variance Tradeoff

Code Demo: Predicting Ice Cream Ratings
Q15: Coding Quiz

P29:
P30:
DS 100: Chapter 20 (Feature Engineering)
DS 100: Chapter 21 (Bias-Variance Tradeoff)
#16: Thursday,
28 October
Regularization Q16: Gradient Descent

P31:
P32:
DS 100: Chapter 22 (Regularization)
Classification #17: Monday,
1 November
Regression on Probabililities; The Logistic Model & Loss Function;

Classwork: Using Logistic Regression
Q17:

P33:
Project: Data Collection
P34:
DS 100: Chapter 24 (Classification)
#18: Thursday,
4 November
Using Logistic Models: Approximating the Empirical Probability Distribution; Fitting & Evaluating a Logistic Model; Q18:

P35:
P36:
DS 100: Chapter 24 (Classification)
#19: Monday,
8 November
Logistic Model: Multiclass Classification;
Support Vector Machines
Q19:

P37:
Project: Analysis
P38:
DS 100: Chapter 24 (Classification)
#20: Thursday,
11 November
Survey of Classifier Techniques Q20: Coding Quiz

P39:
P40:
Dimensionality Reduction #21: Monday,
15 November
Principal Components Analysis

Classwork: PCA Explained Visually
Q21:

P41:
Project: Visualization
P42:
Explained Visually (Principal Components Analysis)
Python Data Science Handbook: Section 5.9 (PCA)
DS 100: Section 26.1 (PCA Dimensions)
#22: Thursday,
18 November
PCA as Dimensionality Reduction
Multiple Dimensional Scaling
Q22:

P43: Component Retention
P44: Digits Components
#23: Monday,
22 November
Non-Linear Dimensionality Reduction: t-SNE, UMAP

Code Demo: more dimensionality reduction (sklearn)
Q23:

P45: Mystery Point
P46:
Manifold Learning (sklearn)
25-26 November Thanksgiving Break: College Closed
Clustering #24: Monday,
29 November
Q24:

P47:
Project: Draft Abstract & Website
P48:
#25: Thursday,
2 December
Q25: Coding Quiz

P49:
Project: Peer Review #2
#26: Monday,
6 December
Q26:

P50:
Project: Abstract
Replicability #27: Thursday,
9 December
Replicability, P-Hacking, A/B testing

Classwork: A/B Testing
Q27:

Project: Website
Project: Video
DS Chapter 25 (Replicability)
Review #28: Monday,
13 December
Review Q28: End-of-semester Survey

Monday, 20 December,
1:45-3:45pm
Final Exam
(This file was last modified on 17 September 2021.)