6 Cross Validation

Cross Validation: Training more efficiently.¶

We will be using a technique called cross-validation. Basically:

Shuffle the training set and chop the training set into $k$ parts.
Train on k-1 parts, and test on the last one – we call that the validation set. That’s the score we are interested in.
Repeat – set aside a different part and repeat.
We will end up with $k$ different models. We’ll average those to find the best one.
Repeat the whole process on a new model.

In [1]:

## Load up our libraries:
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn import neighbors
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import datasets

## and our data:
digits = datasets.load_digits()

X = digits['data']   
Y = digits['target']

In [2]:

##and split it right away:
## Fixing the random seed for demonstration purposes:

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state = 2023)

In [3]:

## learn about cross-validation
?cross_val_score

In [8]:

## and start right away:

##Here we have a tree model of depth 3:
tree_model=tree.DecisionTreeClassifier(max_depth = 6, random_state=2023)

## Train on 4/5 of the data, testing on the 5th.
## Do this 5 times,
CV_score =cross_val_score(tree_model, X_train, Y_train, cv =3)
 

print("We have 5 scores: ", CV_score)

## CV_Score is an array, so we can use things like .mean() to find the average
print("Averaging them gives us an accuracy score of:", CV_score.mean())

We have 5 scores:  [0.80793319 0.76409186 0.78914405]
Averaging them gives us an accuracy score of: 0.7870563674321502

In [5]:

## when we actually make the model we get:
tree_1 = tree.DecisionTreeClassifier(max_depth = 6, random_state=2023) 
tree_1 = tree_1.fit(X_train, Y_train)

Y_pred = tree_1.predict(X_test)
Y_pred_train = tree_1.predict(X_train)

print ("Training Accuracy is ", accuracy_score(Y_train,Y_pred_train))
print ("Testing Accuracy is ", accuracy_score(Y_test,Y_pred))

Training Accuracy is  0.8691718858733473
Testing Accuracy is  0.7944444444444444

Note that while there is still a difference between are training score and test score, the testing accuracy is similar to the average of the CV score above

In [6]:

## Let's try on a better model:
## Remember cupcake from earlier?

cupcake =neighbors.KNeighborsClassifier(n_neighbors=15, p =2 )
CV_score =cross_val_score(cupcake, X_train, Y_train, cv =5)
 
print("We have 5 scores: ", CV_score)
print("Averaging them gives us an accuracy score of:", CV_score.mean())

We have 5 scores:  [0.97222222 0.96875    0.97212544 0.97909408 0.97560976]
Averaging them gives us an accuracy score of: 0.9735602981029811

In [7]:

## and the old fashioned way:

cupcake =neighbors.KNeighborsClassifier(n_neighbors=15, p =2 )
cupcake.fit(X_train,Y_train)

Y_pred_train = cupcake.predict(X_train) 
Y_pred = cupcake.predict(X_test)

print ("Training Accuracy is ", accuracy_score(Y_train,Y_pred_train))
print ("Testing Accuracy is ", accuracy_score(Y_test,Y_pred))

Training Accuracy is  0.9805149617258176
Testing Accuracy is  0.9833333333333333

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Cross Validation: Training more efficiently.¶

License

Share This Book