6 Cross Validation
Cross Validation: Training more efficiently.¶
We will be using a technique called cross-validation. Basically:
- Shuffle the training set and chop the training set into [latex]k[/latex] parts.
- Train on k-1 parts, and test on the last one – we call that the validation set. That’s the score we are interested in.
- Repeat – set aside a different part and repeat.
- We will end up with [latex]k[/latex] different models. We’ll average those to find the best one.
- Repeat the whole process on a new model.
In [1]:
## Load up our libraries:
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn import neighbors
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import datasets
## and our data:
digits = datasets.load_digits()
X = digits['data']
Y = digits['target']
In [2]:
##and split it right away:
## Fixing the random seed for demonstration purposes:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state = 2023)
In [3]:
## learn about cross-validation
?cross_val_score
In [8]:
## and start right away:
##Here we have a tree model of depth 3:
tree_model=tree.DecisionTreeClassifier(max_depth = 6, random_state=2023)
## Train on 4/5 of the data, testing on the 5th.
## Do this 5 times,
CV_score =cross_val_score(tree_model, X_train, Y_train, cv =3)
print("We have 5 scores: ", CV_score)
## CV_Score is an array, so we can use things like .mean() to find the average
print("Averaging them gives us an accuracy score of:", CV_score.mean())
We have 5 scores: [0.80793319 0.76409186 0.78914405] Averaging them gives us an accuracy score of: 0.7870563674321502
In [5]:
## when we actually make the model we get:
tree_1 = tree.DecisionTreeClassifier(max_depth = 6, random_state=2023)
tree_1 = tree_1.fit(X_train, Y_train)
Y_pred = tree_1.predict(X_test)
Y_pred_train = tree_1.predict(X_train)
print ("Training Accuracy is ", accuracy_score(Y_train,Y_pred_train))
print ("Testing Accuracy is ", accuracy_score(Y_test,Y_pred))
Training Accuracy is 0.8691718858733473 Testing Accuracy is 0.7944444444444444
Note that while there is still a difference between are training score and test score, the testing accuracy is similar to the average of the CV score above
In [6]:
## Let's try on a better model:
## Remember cupcake from earlier?
cupcake =neighbors.KNeighborsClassifier(n_neighbors=15, p =2 )
CV_score =cross_val_score(cupcake, X_train, Y_train, cv =5)
print("We have 5 scores: ", CV_score)
print("Averaging them gives us an accuracy score of:", CV_score.mean())
We have 5 scores: [0.97222222 0.96875 0.97212544 0.97909408 0.97560976] Averaging them gives us an accuracy score of: 0.9735602981029811
In [7]:
## and the old fashioned way:
cupcake =neighbors.KNeighborsClassifier(n_neighbors=15, p =2 )
cupcake.fit(X_train,Y_train)
Y_pred_train = cupcake.predict(X_train)
Y_pred = cupcake.predict(X_test)
print ("Training Accuracy is ", accuracy_score(Y_train,Y_pred_train))
print ("Testing Accuracy is ", accuracy_score(Y_test,Y_pred))
Training Accuracy is 0.9805149617258176 Testing Accuracy is 0.9833333333333333