(Created with wordle with text from wiki)

CMP 464-C401/MAT 456-01: Topics Course: Data Science

Spring 2016
Tuesdays & Thursdays: 11am-12:40pm
Prof. Katherine St. John
Email: stjohn AT lehman cuny edu
Office Hours

Announcements:

Useful Links:

Outline:

Date:          Topics: Handouts: Reading: Quiz Topics: HW/Project:
#1
2 February
First Day Details, Topics Overview, Python 2 vs. 3, Python Refresher: basics; Quick look at matplotlib's line and bar charts; Syllabus, DS venn diagram,
Gallery: NY density, nearest airport, citibike, precincts, buses vs. subways, transit + census, life spans, ebola, disease, jobs;
Printing (from __future__), Plotting recipes, matplotlib, Textbook's repo
Academic Integrity Policy,
Chapters 1-3
#1: Academic Integrity
#2
4 February
More on matplotlib: histograms and scatterplots; Data as vectors: scaling, dot products; Means & Variance;
Python Refresher: list comprehensions & zip
list comprehension examples, matplotlib, Textbook's repo, summaries sometimes hides the big picture, Anscombe's Quartet Chapters 2,4,5
4 February Last day to drop without "WD" grade
9 February No class: Classes follow a Friday schedule
#3
11 February
Statistics: Basics;
Python Refresher: lists, tuples, & dictionaries
weather.py, lymeScaled.py, lists vs. tuples, basic stats, dictionary examples Chapters 2,5 #2: Python Basics HW #1: Simple graphs with pyplot
#4
16 February
More on Stats: Correlation & Causation, Simpson's Paradox;
Getting Data: CSV Files
book's statistics.py (depends on linear_algebra.py), Simpson's paradox wiki, wage growth paradox, simple csv example & data Chapters 2,6,9 #3: Vectors, Means, and Variances HW #2: Scaling Vector Data
#5
18 February
Probability: Distributions & Central Limit Theorem;
Python Refresher: collections
dsWiki.txt (for group work), normal distribution calculator, rolling dice, Central Limit Theorem Visualized, Matt Nedrich on CLT Chapters 2,6
18 February Last day to drop with "WD" grade
#6
23 February
Bayes Theorem; Naive Bayes: Spam Filter Example;
Python Refresher: regular expressions
regex cheat sheet, book's naive Bayes spam filter, spam dataset Chapters 2,6,13 #4: Python Lists, Dictionaries, & csv HW #3: Binning Data & Measuring Dispersion
#7
25 February
More on Bayes Theorem; Hypothesis & Inference; Applications;
Python Refresher: more on matplotlib & sets
book's naive Bayes spam filter, spam dataset, twoPlots.py, subplots Chapters 2,7,8
#8
1 March
Hypothesis & Inference: Confidence Intervals; Python Refresher: more on matplotlib Khan Academy on hypothesis testing, normal distribution calculator Chapters 2,7 #5: Correlation & Bayes Theorem HW #4: Correlations & Distributions
#9
3 March
More on Confidence Intervals, A/B Testing;
Python Refresher: numpy
Khan Academy on confidence intervals, numpy, plotting revisited Chapters 7,25
#10
8 March
Manipulating image files with numpy;
Gradient Descent: Estimating, Choosing Right Step Size
scipy lecture notes on arrays, arrays & images, Matt Nedrich's intro to gradient descent & example, Quinn Liu's gradient descent image, 3d surface example code, mplot3d tutorial, matplotlib colormaps Chapters 8,9,25 #6: Regular Expressions HW #5: Bayes Theorem, Simpson's Paradox, & Regular Expressions
#11
10 March
More on gradient descent; Example: Simple Linear Regression;
Geographical maps in matplotlib: basemap
Matt Nedrich's intro to gradient descent & example, Andrew Ng's linear regression notes;
basemap, basemap introduction
Chapters 2,8,9
#12
15 March
Linear Algebra Refresher: Eigenvalues & Eigenvectors;
Using standard data formats: ERSI's shapefiles, JSON, KML;
More on basemap: using shapefiles;
Eigenvectors & eigenvalues, visually, linear transformations example;
ERSI's shapefiles, shapefile wikipage, json, KML, summary & comparison, gdal conversion tools,
NYC shapefiles, shapefiles in basemap tutorial, shapefiles in basemap
Chapters 9,10 #7: Hypothesis & Inference HW #6: A/B Testing
#13
17 March
Using github;
Working with Data: Exploring and Visualizing;
More on Getting Data: scraping webpages, built-in methods, beautifulSoup;
Python Refresher: command line, args & kwargs
github for beginners, github Hello World, github student pack, github cheat sheet;
Anscombe's Quartet
beautifulSoup, soup documentation, where's beautifulSoup?, Frances Zlotnick's tutorial, DOM tutorial, book's code
Chapters 2,10,25
#14
22 March
Working with Multidimensional Data: Rescaling, Principal Components Analysis;
Not from scratch: scipy, scikit-learn & Visualization
Python Refresher: iterators & generators
PCA, explained visually, Lindsay Smith's computing PCA, Sebastian Raschka's PCA overview and implementating in Python;
scipy, sklearn's PCA, pca on iris dataset, NY Fed's unemployment rates and by major
Chapters 2,10,25 #8: Gradient Descent & numpy HW #7: Gradient Descent & Images
#15
24 March
Machine Learning: Modeling, Overfitting, Feature Extraction & Selection;
Python Refresher: lambdas & functions as arguments
generators in Python, lambdaSortingEx.py Chapters 2,11
#16
29 March
Other plotting packages: D3 (javascript) and bokeh (python);
Distances for Multidimensional Data; k-Nearest Neighbors: Language Example, Curse of Dimensionality,
Python Refresher: exceptions
Data Driven Documents (D3), bokeh (D3 styled graphics in Python), bokeh quickstart, bokehPlottingEx.py, bokehChartEx.py
Chapters 2,11,12 #9: Eigenvectors & eigenvalues HW #8: Mapping Data & Markov Chains

Project: Proposal
#17
31 March
Nearest Neighbors & Voronoi Diagrams;
Clustering: k-means
nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering,
Chapters 12,19
#18
5 April
More on clustering: hierarchical clustering, Multidimensional Scaling (MDS) k means example, k-nearest-neighbor versus k-means, scikit-learn clustering, NYC Schools, MS data (for in class)
scikit's MDS, Noel O'Boyle's map example, Zachary Nichols' NYC scaled to commute time and part 2
Chapters 10,19 #10: Using github & beautifulSoup HW #9: Shading Maps & PCA
#19
7 April
Linear Regression, revisited; Multiple Regression
regression recap Chapters 14-15
11 April Last day to drop with "W" grade
#20
12 April
More on Regression: The Bootstrap, Logistic Regression;
Support Vector Machines
logistic regression wiki, Marcel Caracliolo's university entrance example, dummies on iris data set, sklearn logistic regression, 311 Requests (filter for Descriptor = "Pothole"),
bootstrapping wiki, Auckland animation re-sampling from sample vs. samples
Chapter 16 #11: PCA HW #10: Nearest Neighbors

Project: Timeline
#21
14 April
More on SVMs;
Natural Language Processing (NLP)
SVM intro, sklearn ML introduction, sklearn svm, face recognition, sklearn ML intro, sklearn ML advanced
Chapters 16,20
#22
19 April
More on NLP;
Decision Trees
wordle, Google's ngram viewer, Norvig's ngrams
wiki decision trees, sklearn decision trees
Chapters 17,20 #12: Nearest Neighbors & Clustering HW #11: Voronoi Diagrams & Clustering

Project: Data Collection
#23
21 April
Refresher: Trees & Graphs;
Network Analysis
networkx tutorial, Cambridge tutorial, graph review Chapter 21
22-30 April Spring Recess: No Classes
#24
3 May
Recommender Systems
Neural Networks
book's network analysis script, networkx built-in graphs, Knuth miles data, deep learning tutorial (Stanford), neural net wiki Chapters 18, 22 #13: Regression & NLP HW #12: MDS & Regression

Project: Analysis
Project: Visualization & Draft Slide
#25
5 May
MapReduce & PageRank PageRank as applied lin. alg. (SIAM Review 2006) Chapter 23
#26
10 May
Crash Course in SQL Khan Academy on SQL, sqlitebrowser, sqlite, SQL lab Chapter 24 Complete Project
#27
12 May
Not from scratch: iPython (jupyter), pandas, and seaborn Thomas Wiecki's modern guide to data science, OpenTechSchool iPython tutorial,
pandas cookbook, cheat sheet,
seaborn, elevator data
Chapter 25
#28
17 May
Project Presentations Project Sneak Preview Slide
19-20 May Reading Days (no class)
Tuesday
24 May
11am-1pm
Optional Review (meets in Gillet 137)
Thursday
26 May
11am-1pm
Final Examination (required)



(Last updated: 20 May 2016)