4 Using_Sklearn

This page comes from a Jupyter Notebook, located at https://github.com/amygoldlist/BusinessAnalytics/tree/main/AI_management

Using Sklearn

In [1]:
##We should start by loading any packages we will need, as well as any data.

## We can import a full package, but I'm just going to import some:

## sklearn things we need:
#from sklearn.neighbors import KNeighborsClassifier
from sklearn import neighbors
from sklearn.neighbors import *
from sklearn.model_selection import train_test_split
from sklearn import datasets

# numpy and friends
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

## We don't need all of this, but it's good to get anything I need in the very first block.
In [2]:
## Let's start with the handwritten numbers that we've seen a lot of:

digits = datasets.load_digits()

X = digits['data']   # this is the data with each 8x8 image "flattened" into a length-64 vector.
Y = digits['target'] # these are the labels (0-9). 

## Let's look at one number:
## I've chosen randomly, but set n to any number under 1000


n = 653

plt.imshow(digits['images'][n], cmap='Greys_r')
plt.title('This is a %d' % digits['target'][n])

print("The X values are a 64 digit string:", X[n])
print("and the y is the value of the number", Y[n])
The X values are a 64 digit string: [ 0.  0.  4. 10. 16. 16.  7.  0.  0.  3. 16. 13. 11. 16.  2.  0.  0.  1.
  3.  0. 10.  9.  0.  0.  0.  0.  5.  8. 14. 15. 13.  0.  0.  0. 15. 16.
 14. 12.  8.  0.  0.  0.  3. 12.  7.  0.  0.  0.  0.  0.  0. 15.  4.  0.
  0.  0.  0.  0.  3. 14.  1.  0.  0.  0.]
and the y is the value of the number 7
image
In [5]:
## The first step is to break the data into the training set, and the test set.  
## Luckily, sklearn has you covered:

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

## test_size tells the algrithm that we want to reserve 20% for the test (or validation) set.  Try it!  
## Now redo it with a test set at 30%

Assignment Questions

  1. How big is the full data set? How many numbers in training set? How many are held back in the test set? You should have 30% in your test set.
  2. What is the L1 distance between X[15] and X[25]? (Manhattan)
  3. What is the L2 distance between X[15] and X[25]? (Euclidean)
  4. What is the L3 distance between X[15] and X[25]?
In [1]:
## Solutions
In [1]:
## Lets do it!
from sklearn.neighbors import *

## Run this to learn about the function:
?KNeighborsClassifier

## FYI: Minkowski distance means the Lp norm, so we can set p = 2 for Euclidean distance
In [7]:
## For now, I'm going to set K = 3, and use the Euclidean or L2 norm.  

#Pick a name for your model -> I'm going to use a random word: cupcake

#step 1: let python know what algorithm I'm using:
cupcake =neighbors.KNeighborsClassifier(n_neighbors=15, p =2 )
#step 2: Train the model on ONLY THE TRAINING DATA (fit!)
cupcake.fit(X_train,Y_train)


##Now we have a well trained model!  But is it a good model?
## I'm going to look at the errors in BOTH my training and test sets.  
## We do this by predicting what Y will be for each set, than actually comparing

Y_pred_train = cupcake.predict(X_train) 
Y_pred = cupcake.predict(X_test)

##record errors
error_train=list(Y_train == Y_pred_train).count(False)/len(Y_train)
error_test=list(Y_test == Y_pred).count(False)/len(Y_test)

##Look it over:
print("The training error was: ", error_train)
print("The testing error was: ", error_test)
The training error was:  0.016701461377870562
The testing error was:  0.025

Your turn:

For this week, we are going to turn a blind eye to the fact that we are experimenting on the test data. Try out different models! Look at different [latex]K[/latex] values, and different [latex]p[/latex] values to see if you can find something that works better than others. Can you automate this process using some loops or other tools?

Explain your best model, and hand it in. Expect that random change may affect different models.

In [ ]:
 

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Business Analytics Copyright © by Amy Goldlist is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book