CSci 39542 Syllabus    Resources    Coursework

Program 10:
CSci 39542: Introduction to Data Science
Department of Computer Science
Hunter College, City University of New York
Spring 2022

Classwork    Quizzes    Homework    Project   

Program Description

Program 10: Classifying Digits.Due noon, Thursday, 14 April.
Learning Objective: to enhance model building and comparison skills, using standard packages.
Available Libraries: pandas, numpy, pickle, sklearn, and core Python 3.6+.
Data Sources:
MNIST dataset of hand-written digits, available in sklearn digits dataset.
Sample Datasets: sklearn digits dataset.

This program uses the canonical MNIST dataset of hand-written digits discussed in Lecture #18 and available in sklearn digits dataset:

The dataset has 1797 scans of hand-written digits. Each entry has the digit represented (target) as well as the 64 values representing the gray scale for the 8 x 8 image. The first 5 entries are:

The gray scales for the first 5 entries, flattened to one dimensional array:

[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3. 15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.  0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.  0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
[ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.  3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16. 16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]
[ 0.  0.  0.  4. 15. 12.  0.  0.  0.  0.  3. 16. 15. 14.  0.  0.  0.  0.  8. 13.  8. 16.  0.  0.  0.  0.  1.  6. 15. 11.  0.  0.  0.  1.  8. 13. 15.  1.  0.  0.  0.  9. 16. 16.  5.  0.  0.  0.  0.  3. 13. 16. 16. 11.  5.  0.  0.  0.  0.  3. 11. 16.  9.  0.]
[ 0.  0.  7. 15. 13.  1.  0.  0.  0.  8. 13.  6. 15.  4.  0.  0.  0.  2.  1. 13. 13.  0.  0.  0.  0.  0.  2. 15. 11.  1.  0.  0.  0.  0.  0.  1. 12. 12.  1.  0.  0.  0.  0.  0.  1. 10.  8.  0.  0.  0.  8.  4.  5. 14.  9.  0.  0.  0.  7. 13. 13.  9.  0.  0.]
[ 0.  0.  0.  1. 11.  0.  0.  0.  0.  0.  0.  7.  8.  0.  0.  0.  0.  0.  1. 13.  6.  2.  2.  0.  0.  0.  7. 15.  0.  9.  8.  0.  0.  5. 16. 10.  0. 16.  6.  0.  0.  4. 15. 16. 13. 16.  1.  0.  0.  0.  0.  3. 15. 10.  0.  0.  0.  0.  0.  2. 16.  4.  0.  0.]]

Our goal is to predict what number is represented by a vector in the data set. For example, the last line contains a handwritten number '4'. Each entry in the dataset is labeled by the number represented in its gray scale images. The labels ranges from 0 to 9. We will first build binary classifers for the data when restricted to entries whose are labeled 0 or 1, and then classify more diverse subsets.
To start, we will focus on entries that represent 0's and 1's. The first 10 from the dataset are displayed below:

Restricting to just 0's and 1's allows us to build binary classifiers: those distinguishing between two classes. This program employs some of the canonical techiques implemented in sci-kit learn: logistic regression, naive Bayes, support vector machines, and random forests. We will then extend our classifications to larger sets. The function specifications are below:

For the examples, we first load in the digits dataset from sklearn:

#Using the digits data set from sklearn:
from sklearn import datasets
digits = datasets.load_digits()
print(type(, type(
As we saw in lecture, the data set is labeled with the digit represented and the types of these labels and the data is numpy arrays:
[0 1 2 ... 8 9 8]
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
Let's flatten the entries, using the numpy's reshape function:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
print(f'The labels for the first 5 entries: {[:5]}')
The labels of the first five elements in our dataset and their flattened representation:
The targets for the first 5 entries: [0 1 2 3 4]
[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.
   3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16.
  16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.
   0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]
 [ 0.  0.  0.  4. 15. 12.  0.  0.  0.  0.  3. 16. 15. 14.  0.  0.  0.  0.
   8. 13.  8. 16.  0.  0.  0.  0.  1.  6. 15. 11.  0.  0.  0.  1.  8. 13.
  15.  1.  0.  0.  0.  9. 16. 16.  5.  0.  0.  0.  0.  3. 13. 16. 16. 11.
   5.  0.  0.  0.  0.  3. 11. 16.  9.  0.]
 [ 0.  0.  7. 15. 13.  1.  0.  0.  0.  8. 13.  6. 15.  4.  0.  0.  0.  2.
   1. 13. 13.  0.  0.  0.  0.  0.  2. 15. 11.  1.  0.  0.  0.  0.  0.  1.
  12. 12.  1.  0.  0.  0.  0.  0.  1. 10.  8.  0.  0.  0.  8.  4.  5. 14.
   9.  0.  0.  0.  7. 13. 13.  9.  0.  0.]
 [ 0.  0.  0.  1. 11.  0.  0.  0.  0.  0.  0.  7.  8.  0.  0.  0.  0.  0.
   1. 13.  6.  2.  2.  0.  0.  0.  7. 15.  0.  9.  8.  0.  0.  5. 16. 10.
   0. 16.  6.  0.  0.  4. 15. 16. 13. 16.  1.  0.  0.  0.  0.  3. 15. 10.
   0.  0.  0.  0.  0.  2. 16.  4.  0.  0.]]
We can restrict the dataset to just binary digits:
binaryDigits = [(d,t) for (d,t) in zip(data, target) if t <= 1]
bd,bt = zip(*binaryDigits)
print(f'The targets for the first 5 binary entries: {bt[:5]}')
which will print out the labels for the first 5 binary entries:
The targets for the first 5 binary entries: (0, 1, 0, 1, 0)
Let's do the same with our function:
bin_dig, bin_tar = select_data(data,
print(f'The targets for the first 5 binary entries: {bin_tar[:5]}')
which will print out the labels for the first 5 binary entries:
The targets for the first 5 binary entries: (0, 1, 0, 1, 0)
We can also select for other sets of labels:
#Selecting on 6's and 7's:
dig67, tar67 = select_data(data,, labels=[6, 7])
print(f"The targets for the first 5 6's & 7's entries: {tar67[:5]}")
#Selecting on evens:
dig_even, tar_even = select_data(data,,labels=[0, 2, 4, 6, 8])
print(f"The targets for the first 5 even entries: {tar_even[:5]}")
which will print:
The targets for the first 5 6's & 7's entries: (6, 7, 6, 7, 6)
The targets for the first 5 even entries: (0, 2, 4, 6, 8)

Using our functions to restrict the data and targets datasets to 0's and 1's, we can split the data and fit and test the various models:

x_train, x_test, y_train, y_test = split_data(bin_dig, bin_tar, test_size=0.5)
log_pkl = fit_model(x_train, y_train)
y_predict = predict_model(log_pkl, x_train)
log_cmatrix = score_model(log_pkl, x_test, y_test)
print(f'prediction: y_predict\nconfusion matrix:\n log_cmatrix')
will print:
prediction: [1 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1]
confusion matrix:
[[89  0]
 [ 1 90]]
The logistic regression model does extremely well, making only one wrong prediction. Let's see how each of the other models does with the same training and testing subsets:
for m in ['nbayes','svm','rforest']:
    log_pkl = fit_model(x_train,y_train, model_type=m)
    log_cmatrix = score_model(log_pkl,x_test,y_test)
    print(f'The confusion matrix for {m} is:\n {log_cmatrix}')
will print:
The confusion matrix for nbayes is:
 [[88  1]
 [ 1 90]]
The confusion matrix for svm is:
 [[89  0]
 [ 0 91]]
The confusion matrix for rforest is:
 [[89  0]
 [ 0 91]]

All of these models do very well also. Lowering the training set to just 10% of the available input data, we can see which model does the best with for a dataset of those with labels 6 and 7 (in case of ties, return the first one that has that value).

best_mod, best_score = compare_models(dig67,tar67,test_size=0.9,random_state=22)
print(f"The best model for the 6 and 7's dataset is {best_mod} with score {best_score}.")
will print:
The best model for the 6 and 7's dataset is logreg with score 1.0.
While we have focused on binary datasets, the classifiers from sklearn can also be trained for multiclass datasets. Let's try the classifiers on all digits and a training set of 10% of the data:
best_mod, best_score = compare_models(data,,test_size=0.9,random_state=22)
print(f"The best model for the 6 and 7's dataset is {best_mod} with score {best_score}.")
The SVM classifer did the best with:
The best model for the full dataset is svm with score 0.9375772558714462.