(Created with wordle with text from wiki)

CMP 464-C401/MAT 456-01: Topics Course: Data Science

Spring 2016
Tuesdays & Thursdays: 11am-12:40pm
Prof. Katherine St. John
Email: stjohn AT lehman cuny edu
Office Hours


Useful Links:


Date:          Topics: Handouts: Reading: Quiz Topics: HW/Project:
2 February
First Day Details, Topics Overview, Python 2 vs. 3, Python Refresher: basics; Quick look at matplotlib's line and bar charts; Syllabus, DS venn diagram,
Gallery: NY density, nearest airport, citibike, precincts, buses vs. subways, transit + census, life spans, ebola, disease, jobs;
Printing (from __future__), Plotting recipes, matplotlib, Textbook's repo
Academic Integrity Policy,
Chapters 1-3
#1: Academic Integrity
4 February
More on matplotlib: histograms and scatterplots; Data as vectors: scaling, dot products; Means & Variance;
Python Refresher: list comprehensions & zip
list comprehension examples, matplotlib, Textbook's repo, summaries sometimes hides the big picture, Anscombe's Quartet Chapters 2,4,5
4 February Last day to drop without "WD" grade
9 February No class: Classes follow a Friday schedule
11 February
Statistics: Basics;
Python Refresher: lists, tuples, & dictionaries
weather.py, lymeScaled.py, lists vs. tuples, basic stats, dictionary examples Chapters 2,5 #2: Python Basics HW #1: Simple graphs with pyplot
16 February
More on Stats: Correlation & Causation, Simpson's Paradox;
Getting Data: CSV Files
book's statistics.py (depends on linear_algebra.py), Simpson's paradox wiki, wage growth paradox, simple csv example & data Chapters 2,6,9 #3: Vectors, Means, and Variances HW #2: Scaling Vector Data
18 February
Probability: Distributions & Central Limit Theorem;
Python Refresher: collections
dsWiki.txt (for group work), normal distribution calculator, rolling dice, Central Limit Theorem Visualized, Matt Nedrich on CLT Chapters 2,6
18 February Last day to drop with "WD" grade
23 February
Bayes Theorem; Naive Bayes: Spam Filter Example;
Python Refresher: regular expressions
regex cheat sheet, book's naive Bayes spam filter, spam dataset Chapters 2,6,13 #4: Python Lists, Dictionaries, & csv HW #3: Binning Data & Measuring Dispersion
25 February
More on Bayes Theorem; Hypothesis & Inference; Applications;
Python Refresher: more on matplotlib & sets
book's naive Bayes spam filter, spam dataset, twoPlots.py, subplots Chapters 2,7,8
1 March
Hypothesis & Inference: Confidence Intervals; Python Refresher: more on matplotlib Khan Academy on hypothesis testing, normal distribution calculator Chapters 2,7 #5: Correlation & Bayes Theorem HW #4: Correlations & Distributions
3 March
More on Confidence Intervals, A/B Testing;
Python Refresher: numpy
Khan Academy on confidence intervals, numpy, plotting revisited Chapters 7,25
8 March
Manipulating image files with numpy;
Gradient Descent: Estimating, Choosing Right Step Size
scipy lecture notes on arrays, arrays & images, Matt Nedrich's intro to gradient descent & example, Quinn Liu's gradient descent image, 3d surface example code, mplot3d tutorial, matplotlib colormaps Chapters 8,9,25 #6: Regular Expressions HW #5: Bayes Theorem, Simpson's Paradox, & Regular Expressions
10 March
More on gradient descent; Example: Simple Linear Regression;
Geographical maps in matplotlib: basemap
Matt Nedrich's intro to gradient descent & example, Andrew Ng's linear regression notes;
basemap, basemap introduction
Chapters 2,8,9
15 March
Linear Algebra Refresher: Eigenvalues & Eigenvectors;
Using standard data formats: ERSI's shapefiles, JSON, KML;
More on basemap: using shapefiles;
Eigenvectors & eigenvalues, visually, linear transformations example;
ERSI's shapefiles, shapefile wikipage, json, KML, summary & comparison, gdal conversion tools,
NYC shapefiles, shapefiles in basemap tutorial, shapefiles in basemap
Chapters 9,10 #7: Hypothesis & Inference HW #6: A/B Testing
17 March
Using github;
Working with Data: Exploring and Visualizing;
More on Getting Data: scraping webpages, built-in methods, beautifulSoup;
Python Refresher: command line, args & kwargs
github for beginners, github Hello World, github student pack, github cheat sheet;
Anscombe's Quartet
beautifulSoup, soup documentation, where's beautifulSoup?, Frances Zlotnick's tutorial, DOM tutorial, book's code
Chapters 2,10,25
22 March
Working with Multidimensional Data: Rescaling, Principal Components Analysis;
Not from scratch: scipy, scikit-learn & Visualization
Python Refresher: iterators & generators
PCA, explained visually, Lindsay Smith's computing PCA, Sebastian Raschka's PCA overview and implementating in Python;
scipy, sklearn's PCA, pca on iris dataset, NY Fed's unemployment rates and by major
Chapters 2,10,25 #8: Gradient Descent & numpy HW #7: Gradient Descent & Images
24 March
Machine Learning: Modeling, Overfitting, Feature Extraction & Selection;
Python Refresher: lambdas & functions as arguments
generators in Python, lambdaSortingEx.py Chapters 2,11
29 March
Other plotting packages: D3 (javascript) and bokeh (python);
Distances for Multidimensional Data; k-Nearest Neighbors: Language Example, Curse of Dimensionality,
Python Refresher: exceptions
Data Driven Documents (D3), bokeh (D3 styled graphics in Python), bokeh quickstart, bokehPlottingEx.py, bokehChartEx.py
Chapters 2,11,12 #9: Eigenvectors & eigenvalues HW #8: Mapping Data & Markov Chains

Project: Proposal
31 March
Nearest Neighbors & Voronoi Diagrams;
Clustering: k-means
nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering,
Chapters 12,19
5 April
More on clustering: hierarchical clustering, Multidimensional Scaling (MDS) k means example, k-nearest-neighbor versus k-means, scikit-learn clustering, NYC Schools, MS data (for in class)
scikit's MDS, Noel O'Boyle's map example, Zachary Nichols' NYC scaled to commute time and part 2
Chapters 10,19 #10: Using github & beautifulSoup HW #9: Shading Maps & PCA
7 April
Linear Regression, revisited; Multiple Regression
regression recap Chapters 14-15
11 April Last day to drop with "W" grade
12 April
More on Regression: The Bootstrap, Logistic Regression;
Support Vector Machines
logistic regression wiki, Marcel Caracliolo's university entrance example, dummies on iris data set, sklearn logistic regression, 311 Requests (filter for Descriptor = "Pothole"),
bootstrapping wiki, Auckland animation re-sampling from sample vs. samples
Chapter 16 #11: PCA HW #10: Nearest Neighbors

Project: Timeline
14 April
More on SVMs;
Natural Language Processing (NLP)
SVM intro, sklearn ML introduction, sklearn svm, face recognition, sklearn ML intro, sklearn ML advanced
Chapters 16,20
19 April
More on NLP;
Decision Trees
wordle, Google's ngram viewer, Norvig's ngrams
wiki decision trees, sklearn decision trees
Chapters 17,20 #12: Nearest Neighbors & Clustering HW #11: Voronoi Diagrams & Clustering

Project: Data Collection
21 April
Refresher: Trees & Graphs;
Network Analysis
networkx tutorial, Cambridge tutorial, graph review Chapter 21
22-30 April Spring Recess: No Classes
3 May
Recommender Systems
Neural Networks
book's network analysis script, networkx built-in graphs, Knuth miles data, deep learning tutorial (Stanford), neural net wiki Chapters 18, 22 #13: Regression & NLP HW #12: MDS & Regression

Project: Analysis
Project: Visualization & Draft Slide
5 May
MapReduce & PageRank PageRank as applied lin. alg. (SIAM Review 2006) Chapter 23
10 May
Crash Course in SQL Khan Academy on SQL, sqlitebrowser, sqlite, SQL lab Chapter 24 Complete Project
12 May
Not from scratch: iPython (jupyter), pandas, and seaborn Thomas Wiecki's modern guide to data science, OpenTechSchool iPython tutorial,
pandas cookbook, cheat sheet,
seaborn, elevator data
Chapter 25
17 May
Project Presentations Project Sneak Preview Slide
19-20 May Reading Days (no class)
24 May
Optional Review (meets in Gillet 137)
26 May
Final Examination (required)

(Last updated: 20 May 2016)