{"id":419,"date":"2024-02-29T14:29:00","date_gmt":"2024-02-29T19:29:00","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/chapter\/cross-validation\/"},"modified":"2024-02-29T14:31:36","modified_gmt":"2024-02-29T19:31:36","slug":"cross-validation","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/chapter\/cross-validation\/","title":{"raw":"Cross Validation","rendered":"Cross Validation"},"content":{"raw":"<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\r\n<h1 id=\"Cross-Validation:-Training-more-efficiently.\">Cross Validation: Training more efficiently.<a class=\"anchor-link\" href=\"-Training-more-efficiently.\">\u00b6<\/a><\/h1>\r\nWe will be using a technique called cross-validation. Basically:\r\n<ol>\r\n \t<li>Shuffle the training set and chop the training set into [latex]k[\/latex] parts.<\/li>\r\n \t<li>Train on k-1 parts, and test on the last one - we call that the validation set. That's the score we are interested in.<\/li>\r\n \t<li>Repeat - set aside a different part and repeat.<\/li>\r\n \t<li>We will end up with [latex]k[\/latex] different models. We'll average those to find the best one.<\/li>\r\n \t<li>Repeat the whole process on a new model.<\/li>\r\n<\/ol>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[1]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## Load up our libraries:<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.model_selection<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">cross_val_score<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">tree<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">neighbors<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.metrics<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">accuracy_score<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.model_selection<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">train_test_split<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">datasets<\/span>\r\n\r\n<span class=\"c1\">## and our data:<\/span>\r\n<span class=\"n\">digits<\/span> <span class=\"o\">=<\/span> <span class=\"n\">datasets<\/span><span class=\"o\">.<\/span><span class=\"n\">load_digits<\/span><span class=\"p\">()<\/span>\r\n\r\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">digits<\/span><span class=\"p\">[<\/span><span class=\"s1\">'data'<\/span><span class=\"p\">]<\/span>   \r\n<span class=\"n\">Y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">digits<\/span><span class=\"p\">[<\/span><span class=\"s1\">'target'<\/span><span class=\"p\">]<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[2]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">##and split it right away:<\/span>\r\n<span class=\"c1\">## Fixing the random seed for demonstration purposes:<\/span>\r\n\r\n<span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_test<\/span> <span class=\"o\">=<\/span> <span class=\"n\">train_test_split<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span><span class=\"n\">Y<\/span><span class=\"p\">,<\/span> <span class=\"n\">test_size<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.2<\/span><span class=\"p\">,<\/span><span class=\"n\">random_state<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">2023<\/span><span class=\"p\">)<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[3]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## learn about cross-validation<\/span>\r\n<span class=\"o\">?<\/span>cross_val_score\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[8]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## and start right away:<\/span>\r\n\r\n<span class=\"c1\">##Here we have a tree model of depth 3:<\/span>\r\n<span class=\"n\">tree_model<\/span><span class=\"o\">=<\/span><span class=\"n\">tree<\/span><span class=\"o\">.<\/span><span class=\"n\">DecisionTreeClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">max_depth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">6<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">2023<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"c1\">## Train on 4\/5 of the data, testing on the 5th.<\/span>\r\n<span class=\"c1\">## Do this 5 times,<\/span>\r\n<span class=\"n\">CV_score<\/span> <span class=\"o\">=<\/span><span class=\"n\">cross_val_score<\/span><span class=\"p\">(<\/span><span class=\"n\">tree_model<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">cv<\/span> <span class=\"o\">=<\/span><span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\r\n \r\n\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"We have 5 scores: \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"c1\">## CV_Score is an array, so we can use things like .mean() to find the average<\/span>\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Averaging them gives us an accuracy score of:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell-outputWrapper\">\r\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\r\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\r\n<div class=\"jp-OutputArea-child\">\r\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\r\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\r\n<pre>We have 5 scores:  [0.80793319 0.76409186 0.78914405]\r\nAveraging them gives us an accuracy score of: 0.7870563674321502\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[5]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## when we actually make the model we get:<\/span>\r\n<span class=\"n\">tree_1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree<\/span><span class=\"o\">.<\/span><span class=\"n\">DecisionTreeClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">max_depth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">6<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">2023<\/span><span class=\"p\">)<\/span> \r\n<span class=\"n\">tree_1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">Y_pred<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\r\n<span class=\"n\">Y_pred_train<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Training Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred_train<\/span><span class=\"p\">))<\/span>\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Testing Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred<\/span><span class=\"p\">))<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell-outputWrapper\">\r\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\r\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\r\n<div class=\"jp-OutputArea-child\">\r\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\r\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\r\n<pre>Training Accuracy is  0.8691718858733473\r\nTesting Accuracy is  0.7944444444444444\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\"><\/div>\r\n<div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\r\n\r\nNote that while there is still a difference between are training score and test score, the testing accuracy is similar to the average of the CV score above\r\n\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[6]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## Let's try on a better model:<\/span>\r\n<span class=\"c1\">## Remember cupcake from earlier?<\/span>\r\n\r\n<span class=\"n\">cupcake<\/span> <span class=\"o\">=<\/span><span class=\"n\">neighbors<\/span><span class=\"o\">.<\/span><span class=\"n\">KNeighborsClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">n_neighbors<\/span><span class=\"o\">=<\/span><span class=\"mi\">15<\/span><span class=\"p\">,<\/span> <span class=\"n\">p<\/span> <span class=\"o\">=<\/span><span class=\"mi\">2<\/span> <span class=\"p\">)<\/span>\r\n<span class=\"n\">CV_score<\/span> <span class=\"o\">=<\/span><span class=\"n\">cross_val_score<\/span><span class=\"p\">(<\/span><span class=\"n\">cupcake<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">cv<\/span> <span class=\"o\">=<\/span><span class=\"mi\">5<\/span><span class=\"p\">)<\/span>\r\n \r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"We have 5 scores: \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"p\">)<\/span>\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Averaging them gives us an accuracy score of:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell-outputWrapper\">\r\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\r\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\r\n<div class=\"jp-OutputArea-child\">\r\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\r\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\r\n<pre>We have 5 scores:  [0.97222222 0.96875    0.97212544 0.97909408 0.97560976]\r\nAveraging them gives us an accuracy score of: 0.9735602981029811\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\r\n<div class=\"jp-Cell-inputWrapper\">\r\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\r\n<div class=\"jp-InputArea jp-Cell-inputArea\">\r\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[7]:<\/div>\r\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\r\n<div class=\"CodeMirror cm-s-jupyter\">\r\n<div class=\"highlight hl-ipython3\">\r\n<pre><span class=\"c1\">## and the old fashioned way:<\/span>\r\n\r\n<span class=\"n\">cupcake<\/span> <span class=\"o\">=<\/span><span class=\"n\">neighbors<\/span><span class=\"o\">.<\/span><span class=\"n\">KNeighborsClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">n_neighbors<\/span><span class=\"o\">=<\/span><span class=\"mi\">15<\/span><span class=\"p\">,<\/span> <span class=\"n\">p<\/span> <span class=\"o\">=<\/span><span class=\"mi\">2<\/span> <span class=\"p\">)<\/span>\r\n<span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">Y_pred_train<\/span> <span class=\"o\">=<\/span> <span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">)<\/span> \r\n<span class=\"n\">Y_pred<\/span> <span class=\"o\">=<\/span> <span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Training Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred_train<\/span><span class=\"p\">))<\/span>\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Testing Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred<\/span><span class=\"p\">))<\/span>\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<div class=\"jp-Cell-outputWrapper\">\r\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\r\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\r\n<div class=\"jp-OutputArea-child\">\r\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\r\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\r\n<pre>Training Accuracy is  0.9805149617258176\r\nTesting Accuracy is  0.9833333333333333\r\n<\/pre>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>","rendered":"<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h1 id=\"Cross-Validation:-Training-more-efficiently.\">Cross Validation: Training more efficiently.<a class=\"anchor-link\" href=\"-Training-more-efficiently.\">\u00b6<\/a><\/h1>\n<p>We will be using a technique called cross-validation. Basically:<\/p>\n<ol>\n<li>Shuffle the training set and chop the training set into [latex]k[\/latex] parts.<\/li>\n<li>Train on k-1 parts, and test on the last one &#8211; we call that the validation set. That&#8217;s the score we are interested in.<\/li>\n<li>Repeat &#8211; set aside a different part and repeat.<\/li>\n<li>We will end up with [latex]k[\/latex] different models. We&#8217;ll average those to find the best one.<\/li>\n<li>Repeat the whole process on a new model.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[1]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## Load up our libraries:<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.model_selection<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">cross_val_score<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">tree<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">neighbors<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.metrics<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">accuracy_score<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn.model_selection<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">train_test_split<\/span>\r\n<span class=\"kn\">from<\/span> <span class=\"nn\">sklearn<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">datasets<\/span>\r\n\r\n<span class=\"c1\">## and our data:<\/span>\r\n<span class=\"n\">digits<\/span> <span class=\"o\">=<\/span> <span class=\"n\">datasets<\/span><span class=\"o\">.<\/span><span class=\"n\">load_digits<\/span><span class=\"p\">()<\/span>\r\n\r\n<span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">digits<\/span><span class=\"p\">[<\/span><span class=\"s1\">'data'<\/span><span class=\"p\">]<\/span>   \r\n<span class=\"n\">Y<\/span> <span class=\"o\">=<\/span> <span class=\"n\">digits<\/span><span class=\"p\">[<\/span><span class=\"s1\">'target'<\/span><span class=\"p\">]<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[2]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">##and split it right away:<\/span>\r\n<span class=\"c1\">## Fixing the random seed for demonstration purposes:<\/span>\r\n\r\n<span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_test<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_test<\/span> <span class=\"o\">=<\/span> <span class=\"n\">train_test_split<\/span><span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"p\">,<\/span><span class=\"n\">Y<\/span><span class=\"p\">,<\/span> <span class=\"n\">test_size<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.2<\/span><span class=\"p\">,<\/span><span class=\"n\">random_state<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">2023<\/span><span class=\"p\">)<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[3]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## learn about cross-validation<\/span>\r\n<span class=\"o\">?<\/span>cross_val_score\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[8]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## and start right away:<\/span>\r\n\r\n<span class=\"c1\">##Here we have a tree model of depth 3:<\/span>\r\n<span class=\"n\">tree_model<\/span><span class=\"o\">=<\/span><span class=\"n\">tree<\/span><span class=\"o\">.<\/span><span class=\"n\">DecisionTreeClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">max_depth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">6<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">2023<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"c1\">## Train on 4\/5 of the data, testing on the 5th.<\/span>\r\n<span class=\"c1\">## Do this 5 times,<\/span>\r\n<span class=\"n\">CV_score<\/span> <span class=\"o\">=<\/span><span class=\"n\">cross_val_score<\/span><span class=\"p\">(<\/span><span class=\"n\">tree_model<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">cv<\/span> <span class=\"o\">=<\/span><span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\r\n \r\n\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"We have 5 scores: \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"c1\">## CV_Score is an array, so we can use things like .mean() to find the average<\/span>\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Averaging them gives us an accuracy score of:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre>We have 5 scores:  [0.80793319 0.76409186 0.78914405]\r\nAveraging them gives us an accuracy score of: 0.7870563674321502\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[5]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## when we actually make the model we get:<\/span>\r\n<span class=\"n\">tree_1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree<\/span><span class=\"o\">.<\/span><span class=\"n\">DecisionTreeClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">max_depth<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">6<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\">=<\/span><span class=\"mi\">2023<\/span><span class=\"p\">)<\/span> \r\n<span class=\"n\">tree_1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">Y_pred<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\r\n<span class=\"n\">Y_pred_train<\/span> <span class=\"o\">=<\/span> <span class=\"n\">tree_1<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Training Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred_train<\/span><span class=\"p\">))<\/span>\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Testing Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred<\/span><span class=\"p\">))<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre>Training Accuracy is  0.8691718858733473\r\nTesting Accuracy is  0.7944444444444444\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\"><\/div>\n<div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p>Note that while there is still a difference between are training score and test score, the testing accuracy is similar to the average of the CV score above<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[6]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## Let's try on a better model:<\/span>\r\n<span class=\"c1\">## Remember cupcake from earlier?<\/span>\r\n\r\n<span class=\"n\">cupcake<\/span> <span class=\"o\">=<\/span><span class=\"n\">neighbors<\/span><span class=\"o\">.<\/span><span class=\"n\">KNeighborsClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">n_neighbors<\/span><span class=\"o\">=<\/span><span class=\"mi\">15<\/span><span class=\"p\">,<\/span> <span class=\"n\">p<\/span> <span class=\"o\">=<\/span><span class=\"mi\">2<\/span> <span class=\"p\">)<\/span>\r\n<span class=\"n\">CV_score<\/span> <span class=\"o\">=<\/span><span class=\"n\">cross_val_score<\/span><span class=\"p\">(<\/span><span class=\"n\">cupcake<\/span><span class=\"p\">,<\/span> <span class=\"n\">X_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span> <span class=\"n\">cv<\/span> <span class=\"o\">=<\/span><span class=\"mi\">5<\/span><span class=\"p\">)<\/span>\r\n \r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"We have 5 scores: \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"p\">)<\/span>\r\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Averaging them gives us an accuracy score of:\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">CV_score<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span><span class=\"p\">())<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre>We have 5 scores:  [0.97222222 0.96875    0.97212544 0.97909408 0.97560976]\r\nAveraging them gives us an accuracy score of: 0.9735602981029811\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\"><\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[7]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"CodeMirror cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\">\n<pre><span class=\"c1\">## and the old fashioned way:<\/span>\r\n\r\n<span class=\"n\">cupcake<\/span> <span class=\"o\">=<\/span><span class=\"n\">neighbors<\/span><span class=\"o\">.<\/span><span class=\"n\">KNeighborsClassifier<\/span><span class=\"p\">(<\/span><span class=\"n\">n_neighbors<\/span><span class=\"o\">=<\/span><span class=\"mi\">15<\/span><span class=\"p\">,<\/span> <span class=\"n\">p<\/span> <span class=\"o\">=<\/span><span class=\"mi\">2<\/span> <span class=\"p\">)<\/span>\r\n<span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">fit<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"n\">Y_pred_train<\/span> <span class=\"o\">=<\/span> <span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_train<\/span><span class=\"p\">)<\/span> \r\n<span class=\"n\">Y_pred<\/span> <span class=\"o\">=<\/span> <span class=\"n\">cupcake<\/span><span class=\"o\">.<\/span><span class=\"n\">predict<\/span><span class=\"p\">(<\/span><span class=\"n\">X_test<\/span><span class=\"p\">)<\/span>\r\n\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Training Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_train<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred_train<\/span><span class=\"p\">))<\/span>\r\n<span class=\"nb\">print<\/span> <span class=\"p\">(<\/span><span class=\"s2\">\"Testing Accuracy is \"<\/span><span class=\"p\">,<\/span> <span class=\"n\">accuracy_score<\/span><span class=\"p\">(<\/span><span class=\"n\">Y_test<\/span><span class=\"p\">,<\/span><span class=\"n\">Y_pred<\/span><span class=\"p\">))<\/span>\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\"><\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre>Training Accuracy is  0.9805149617258176\r\nTesting Accuracy is  0.9833333333333333\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":883,"menu_order":11,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-419","chapter","type-chapter","status-publish","hentry"],"part":64,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/419","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/users\/883"}],"version-history":[{"count":3,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/419\/revisions"}],"predecessor-version":[{"id":422,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/419\/revisions\/422"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/parts\/64"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/419\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/media?parent=419"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapter-type?post=419"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/contributor?post=419"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/license?post=419"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}