{"id":336,"date":"2024-01-11T12:32:30","date_gmt":"2024-01-11T17:32:30","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/?post_type=chapter&#038;p=336"},"modified":"2024-01-22T14:32:36","modified_gmt":"2024-01-22T19:32:36","slug":"decision-trees-and-split-points","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/chapter\/decision-trees-and-split-points\/","title":{"raw":"Decision Trees and split points","rendered":"Decision Trees and split points"},"content":{"raw":"<h2>Making a decision on split points:<\/h2>\r\nWe are going to look at a very small example of a decision tree, and then expand a bit.\r\n\r\nHere is our 3 item, two variable problem.\u00a0 We are trying to predict whether a customer purchased a new financial product, based solely on their income:\r\n<table class=\"aligncenter\" style=\"border-collapse: collapse;width: 50%;height: 72px\" border=\"0\">\r\n<tbody>\r\n<tr style=\"height: 18px\">\r\n<td style=\"width: 9.55403%;height: 18px\"><strong>ID<\/strong><\/td>\r\n<td style=\"width: 8.32127%;height: 18px\"><strong>Income<\/strong><\/td>\r\n<td style=\"width: 11.2763%;height: 18px\"><strong>Purchase<\/strong><\/td>\r\n<\/tr>\r\n<tr style=\"height: 18px\">\r\n<td style=\"width: 9.55403%;height: 18px\">1<\/td>\r\n<td style=\"width: 8.32127%;height: 18px\">$12,000<\/td>\r\n<td style=\"width: 11.2763%;height: 18px\">0 (NO)<\/td>\r\n<\/tr>\r\n<tr style=\"height: 18px\">\r\n<td style=\"width: 9.55403%;height: 18px\">2<\/td>\r\n<td style=\"width: 8.32127%;height: 18px\">$15,000<\/td>\r\n<td style=\"width: 11.2763%;height: 18px\">0 (NO)<\/td>\r\n<\/tr>\r\n<tr style=\"height: 18px\">\r\n<td style=\"width: 9.55403%;height: 18px\">3<\/td>\r\n<td style=\"width: 8.32127%;height: 18px\">$25,000<\/td>\r\n<td style=\"width: 11.2763%;height: 18px\">1 (YES)<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nOur Tree will look like this:\r\n\r\n<img class=\"size-medium wp-image-337 aligncenter\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-300x179.png\" alt=\"A simple tree diagram\" width=\"300\" height=\"179\" \/>\r\n\r\nWhat we have to do is calculate the split point: what should be our [latex]??[\/latex] in the diagram.\r\n\r\nWhat we do is sort all of our data in order by income.\u00a0 Our candidate split points are halfway between each consecutive pair.\r\n\r\nHere we have:\r\n<ul>\r\n \t<li>[latex]\\$13,500 = \\frac{\\$12,000 + \\$15,000}{2}[\/latex]<\/li>\r\n \t<li>[latex]\\$20,000 = \\frac{\\$15,000 + \\$25,000}{2}[\/latex]<\/li>\r\n<\/ul>\r\nFor each of these, we will count the number of 0 and 1 in each tree, and try to pick the \"best\" one:\r\n\r\n[caption id=\"attachment_338\" align=\"aligncenter\" width=\"300\"]<img class=\"size-medium wp-image-338\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png\" alt=\"Tree with a split point of $13,500 \" width=\"300\" height=\"179\" \/> There are now two boxes.[\/caption]\r\n\r\nNow we calculate the probabilities of each outcome:\r\n<ul>\r\n \t<li>[latex]P(0 | \\leq \\$13,500) = 1[\/latex]<\/li>\r\n \t<li>[latex]P(1 | \\leq \\$13,500) = 0[\/latex]<\/li>\r\n \t<li>[latex]P(0 | &gt; \\$13,500) = 0.5[\/latex]<\/li>\r\n \t<li>[latex]P(0 | \\leq \\$13,500) = 0.5[\/latex]<\/li>\r\n<\/ul>\r\nWe will analyze this properly soon, but let's redo this analysis for our second candidate split point:\r\n\r\n&nbsp;\r\n\r\n[caption id=\"attachment_340\" align=\"aligncenter\" width=\"300\"]<img class=\"size-medium wp-image-340\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-300x179.png\" alt=\"A tree at split point of $20k\" width=\"300\" height=\"179\" \/> This is called a Pure Split Point[\/caption]\r\n\r\nThis is called a \"Pure Split Point\" - that is, each box has either all 0 or all 1.\u00a0 Our probabilities are:\r\n<ul>\r\n \t<li>[latex]P(0 | \\leq \\$20,000) = 1[\/latex]<\/li>\r\n \t<li>[latex]P(1 | \\leq \\$20,000) = 0[\/latex]<\/li>\r\n \t<li>[latex]P(0 | &gt;\\$20,000) = 0[\/latex]<\/li>\r\n \t<li>[latex]P(0 | &gt; \\$20,000) = 1[\/latex]<\/li>\r\n<\/ul>\r\nThis seems like a better choice!\u00a0 But let's quantify this:\r\n\r\nThere are three main methods of choosing the best split point, and unsurprisingly they all involve statistical measures of dispersion:\r\n<ul>\r\n \t<li>Gini (named after a person, not an acronym!)<\/li>\r\n \t<li>Entropy<\/li>\r\n \t<li>[latex]\\chi^2[\/latex] measure of independence<\/li>\r\n<\/ul>\r\n&nbsp;\r\n<h2>GINI<\/h2>\r\nIn the Gini method, we find the Gini index for each split point, and then compare.\u00a0 The Gini index is a type of weighted average of the probabiliteis that we calculated above:\u00a0 Here, I'm going to use the word \"boxes\" to describe each of the end points: being less than or greater than the split point.\u00a0 In our simple example we have two boxes.\u00a0 I will use classes to describe the outcomes that are possible:\u00a0 0 or 1 in our case.\u00a0 It's easier to look at classification that uses only 2 categories or classes, but we can extend to more classes.\u00a0 We will also use [latex]N[\/latex] as the total number of data points.\u00a0 Here it is 3!\r\n\r\n&nbsp;\r\n\r\nWe calculate this via the equation:\r\n\r\n\\[ GINI_{split} =\\Sigma_{boxes} \\left[ \\frac{\\text{# of points in the box}}{N} \\cdot Gini_{box}\\right] \\]\r\n\r\nwhere\r\n\r\n\\[Gini_{box} =1 - \\Sigma_{classes}\u00a0 \\left(\u00a0 \\frac{\\text{number of 0 or 1 in box}}{\\text{# of points in the box}}\\right)^2\\]\r\n\r\nThis is easier calculate than to understand the math!\r\n\r\nRecall the chart:\r\n\r\n[caption id=\"attachment_338\" align=\"alignnone\" width=\"300\"]<img class=\"size-medium wp-image-338\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png\" alt=\"Tree with a split point of $13,500\" width=\"300\" height=\"179\" \/> There are now two classes:[\/caption]\r\n\r\nWe will calculate the Gini number for each part:\r\n\r\n[latex]\r\n\\begin{align*}\r\nGini_{&lt;13,500} &amp;= 1 - \\left[ \\left(\\frac{ \\text{# of 0 &lt; 13,500}} {\\text{# &lt; 13,500}} \\right)^2 +\\left(\\frac {\\text{# of 1&lt; 13,500}} {\\text{# &lt; 13,500}}\\right)^2 \\right]\\\\\r\n\r\n&amp;=1- \\left[ \\left( \\frac{1}{1}\\right)^2+ \\left(\\frac{0}{1}\\right)^2\u00a0 \\right]\\\\\r\n\r\n&amp; = 1 - \\left[ 1 + 0\u00a0 \\right]\\\\\r\n\r\n&amp; = 0\\end{align*}\r\n\r\n[\/latex]\r\n\r\n&nbsp;\r\n\r\nand:\r\n[latex]\r\n\\begin{align*}\r\nGini_{&gt;13,500} &amp;= 1 - \\left[ \\left(\\frac{ \\text{# of 0 &gt; 13,500}} {\\text{# &gt; 13,500}} \\right)^2 +\\left(\\frac {\\text{# of 1&gt; 13,500}} {\\text{# &gt; 13,500}}\\right)^2 \\right]\\\\\r\n\r\n&amp;=1- \\left[ \\left( \\frac{1}{2}\\right)^2+ \\left(\\frac{1}{2}\\right)^2\u00a0 \\right]\\\\\r\n\r\n&amp; = 1 - \\left[ \\frac{1}{4}+\\frac{1}{4}\u00a0 \\right]\\\\\r\n\r\n&amp; =\\frac{1}{2}\r\n\r\n\\end{align*}\r\n\r\n[\/latex]\r\n\r\n0 is the lowest a Gini Index can be, we call this a pure, and 0.5 is the worst case scenario.\r\n\r\nThe Index is given by:\r\n\r\n\\[ Gini_{13,500} = \\left( \\frac{\\text{# &lt; 13,500}}{Total}\\right) \\cdot Gini_{&lt; 13,500}+ \\left( \\frac{\\text{# &gt;13,500}}{Total}\\right) \\cdot Gini_{&gt;13,500}\\]\r\n\r\nOr:\r\n\r\n\\[Gini_{13,500} = \\frac{1}{3}\\cdot 0 + \\frac{2}{3}\\cdot \\frac{1}{2} = \\frac{1}{3}\\]\r\n\r\nNow we have the fun of repeating for our second split point, $20,000.\u00a0 This will be easier, as we do have a pure split -\r\n\r\n\\[Gini_{&lt;20,000} = 1 -\\left[\\left( \\frac{2}{2}\\right)^2 +\\left( \\frac{0}{2}\\right)^2 \\right] = 1-1=0\\]\r\n\r\nand\r\n\r\n\\[Gini_{&gt;20,000} = 1 -\\left[\\left( \\frac{0}{1}\\right)^2 +\\left( \\frac{1}{1}\\right)^2 \\right] = 1-1=0\\]\r\n\r\nWhich gives us:\r\n\r\n\\[Gini_{20,000} = \\frac{1}{3}\\cdot 0 + \\frac{2}{3}\\cdot 0 = 0\\]\r\n\r\nHence, when we minimize the Gini Index, we choose $20,000 as a split point, and our algorithm now looks like this:\r\n\r\n<img class=\"size-medium wp-image-366 aligncenter\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-300x179.png\" alt=\"\" width=\"300\" height=\"179\" \/>\r\n\r\nOf course, doing this by hand is time consuming!\u00a0 We really should be using a computer, if we had more than three points.\r\n<h2>Entropy<\/h2>\r\nCalculating these by hand is not a fun task, but we briefly look at a different measure, Entropy, which we are again trying to minimize.\r\n\r\n\\[Entropy_{box} = -\\frac{\\text{# in class}}{N} \\Sigma \\frac{\\text{#}}{N} \\cdot\u00a0 ln \\left(\\frac{\\text{#}}{N}\\right)\\]\r\n<h2>continue later Amy!<\/h2>\r\n&nbsp;\r\n<h2>Decisions\u00a0 \/\u00a0 hyperparameters<\/h2>\r\n<ul>\r\n \t<li>Should we look at all features?\u00a0 Or just choose one randomly<\/li>\r\n \t<li>How many classes at each split point?<\/li>\r\n \t<li>How deep do we want the tree?<\/li>\r\n<\/ul>","rendered":"<h2>Making a decision on split points:<\/h2>\n<p>We are going to look at a very small example of a decision tree, and then expand a bit.<\/p>\n<p>Here is our 3 item, two variable problem.\u00a0 We are trying to predict whether a customer purchased a new financial product, based solely on their income:<\/p>\n<table class=\"aligncenter\" style=\"border-collapse: collapse;width: 50%;height: 72px\">\n<tbody>\n<tr style=\"height: 18px\">\n<td style=\"width: 9.55403%;height: 18px\"><strong>ID<\/strong><\/td>\n<td style=\"width: 8.32127%;height: 18px\"><strong>Income<\/strong><\/td>\n<td style=\"width: 11.2763%;height: 18px\"><strong>Purchase<\/strong><\/td>\n<\/tr>\n<tr style=\"height: 18px\">\n<td style=\"width: 9.55403%;height: 18px\">1<\/td>\n<td style=\"width: 8.32127%;height: 18px\">$12,000<\/td>\n<td style=\"width: 11.2763%;height: 18px\">0 (NO)<\/td>\n<\/tr>\n<tr style=\"height: 18px\">\n<td style=\"width: 9.55403%;height: 18px\">2<\/td>\n<td style=\"width: 8.32127%;height: 18px\">$15,000<\/td>\n<td style=\"width: 11.2763%;height: 18px\">0 (NO)<\/td>\n<\/tr>\n<tr style=\"height: 18px\">\n<td style=\"width: 9.55403%;height: 18px\">3<\/td>\n<td style=\"width: 8.32127%;height: 18px\">$25,000<\/td>\n<td style=\"width: 11.2763%;height: 18px\">1 (YES)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Our Tree will look like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-337 aligncenter\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-300x179.png\" alt=\"A simple tree diagram\" width=\"300\" height=\"179\" srcset=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-300x179.png 300w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-65x39.png 65w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-225x135.png 225w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree-350x209.png 350w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/empty_tree.png 580w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>What we have to do is calculate the split point: what should be our [latex]??[\/latex] in the diagram.<\/p>\n<p>What we do is sort all of our data in order by income.\u00a0 Our candidate split points are halfway between each consecutive pair.<\/p>\n<p>Here we have:<\/p>\n<ul>\n<li>[latex]\\$13,500 = \\frac{\\$12,000 + \\$15,000}{2}[\/latex]<\/li>\n<li>[latex]\\$20,000 = \\frac{\\$15,000 + \\$25,000}{2}[\/latex]<\/li>\n<\/ul>\n<p>For each of these, we will count the number of 0 and 1 in each tree, and try to pick the &#8220;best&#8221; one:<\/p>\n<figure id=\"attachment_338\" aria-describedby=\"caption-attachment-338\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-338\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png\" alt=\"Tree with a split point of $13,500\" width=\"300\" height=\"179\" srcset=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png 300w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-65x39.png 65w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-225x135.png 225w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-350x209.png 350w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1.png 580w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><figcaption id=\"caption-attachment-338\" class=\"wp-caption-text\">There are now two boxes.<\/figcaption><\/figure>\n<p>Now we calculate the probabilities of each outcome:<\/p>\n<ul>\n<li>[latex]P(0 | \\leq \\$13,500) = 1[\/latex]<\/li>\n<li>[latex]P(1 | \\leq \\$13,500) = 0[\/latex]<\/li>\n<li>[latex]P(0 | > \\$13,500) = 0.5[\/latex]<\/li>\n<li>[latex]P(0 | \\leq \\$13,500) = 0.5[\/latex]<\/li>\n<\/ul>\n<p>We will analyze this properly soon, but let&#8217;s redo this analysis for our second candidate split point:<\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_340\" aria-describedby=\"caption-attachment-340\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-340\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-300x179.png\" alt=\"A tree at split point of $20k\" width=\"300\" height=\"179\" srcset=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-300x179.png 300w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-65x39.png 65w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-225x135.png 225w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split-350x209.png 350w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_pure_split.png 580w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><figcaption id=\"caption-attachment-340\" class=\"wp-caption-text\">This is called a Pure Split Point<\/figcaption><\/figure>\n<p>This is called a &#8220;Pure Split Point&#8221; &#8211; that is, each box has either all 0 or all 1.\u00a0 Our probabilities are:<\/p>\n<ul>\n<li>[latex]P(0 | \\leq \\$20,000) = 1[\/latex]<\/li>\n<li>[latex]P(1 | \\leq \\$20,000) = 0[\/latex]<\/li>\n<li>[latex]P(0 | >\\$20,000) = 0[\/latex]<\/li>\n<li>[latex]P(0 | > \\$20,000) = 1[\/latex]<\/li>\n<\/ul>\n<p>This seems like a better choice!\u00a0 But let&#8217;s quantify this:<\/p>\n<p>There are three main methods of choosing the best split point, and unsurprisingly they all involve statistical measures of dispersion:<\/p>\n<ul>\n<li>Gini (named after a person, not an acronym!)<\/li>\n<li>Entropy<\/li>\n<li>[latex]\\chi^2[\/latex] measure of independence<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>GINI<\/h2>\n<p>In the Gini method, we find the Gini index for each split point, and then compare.\u00a0 The Gini index is a type of weighted average of the probabiliteis that we calculated above:\u00a0 Here, I&#8217;m going to use the word &#8220;boxes&#8221; to describe each of the end points: being less than or greater than the split point.\u00a0 In our simple example we have two boxes.\u00a0 I will use classes to describe the outcomes that are possible:\u00a0 0 or 1 in our case.\u00a0 It&#8217;s easier to look at classification that uses only 2 categories or classes, but we can extend to more classes.\u00a0 We will also use [latex]N[\/latex] as the total number of data points.\u00a0 Here it is 3!<\/p>\n<p>&nbsp;<\/p>\n<p>We calculate this via the equation:<\/p>\n<p>\\[ GINI_{split} =\\Sigma_{boxes} \\left[ \\frac{\\text{# of points in the box}}{N} \\cdot Gini_{box}\\right] \\]<\/p>\n<p>where<\/p>\n<p>\\[Gini_{box} =1 &#8211; \\Sigma_{classes}\u00a0 \\left(\u00a0 \\frac{\\text{number of 0 or 1 in box}}{\\text{# of points in the box}}\\right)^2\\]<\/p>\n<p>This is easier calculate than to understand the math!<\/p>\n<p>Recall the chart:<\/p>\n<figure id=\"attachment_338\" aria-describedby=\"caption-attachment-338\" style=\"width: 300px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-338\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png\" alt=\"Tree with a split point of $13,500\" width=\"300\" height=\"179\" srcset=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-300x179.png 300w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-65x39.png 65w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-225x135.png 225w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1-350x209.png 350w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_split_1.png 580w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><figcaption id=\"caption-attachment-338\" class=\"wp-caption-text\">There are now two classes:<\/figcaption><\/figure>\n<p>We will calculate the Gini number for each part:<\/p>\n<p>[latex]\\begin{align*}  Gini_{<13,500} &= 1 - \\left[ \\left(\\frac{ \\text{# of 0 < 13,500}} {\\text{# < 13,500}} \\right)^2 +\\left(\\frac {\\text{# of 1< 13,500}} {\\text{# < 13,500}}\\right)^2 \\right]\\\\    &=1- \\left[ \\left( \\frac{1}{1}\\right)^2+ \\left(\\frac{0}{1}\\right)^2\u00a0 \\right]\\\\    & = 1 - \\left[ 1 + 0\u00a0 \\right]\\\\    & = 0\\end{align*}[\/latex]\n\n&nbsp;\n\nand:\n[latex]\\begin{align*}  Gini_{>13,500} &= 1 - \\left[ \\left(\\frac{ \\text{# of 0 > 13,500}} {\\text{# > 13,500}} \\right)^2 +\\left(\\frac {\\text{# of 1> 13,500}} {\\text{# > 13,500}}\\right)^2 \\right]\\\\    &=1- \\left[ \\left( \\frac{1}{2}\\right)^2+ \\left(\\frac{1}{2}\\right)^2\u00a0 \\right]\\\\    & = 1 - \\left[ \\frac{1}{4}+\\frac{1}{4}\u00a0 \\right]\\\\    & =\\frac{1}{2}    \\end{align*}[\/latex]<\/p>\n<p>0 is the lowest a Gini Index can be, we call this a pure, and 0.5 is the worst case scenario.<\/p>\n<p>The Index is given by:<\/p>\n<p>\\[ Gini_{13,500} = \\left( \\frac{\\text{# &lt; 13,500}}{Total}\\right) \\cdot Gini_{&lt; 13,500}+ \\left( \\frac{\\text{# &gt;13,500}}{Total}\\right) \\cdot Gini_{&gt;13,500}\\]<\/p>\n<p>Or:<\/p>\n<p>\\[Gini_{13,500} = \\frac{1}{3}\\cdot 0 + \\frac{2}{3}\\cdot \\frac{1}{2} = \\frac{1}{3}\\]<\/p>\n<p>Now we have the fun of repeating for our second split point, $20,000.\u00a0 This will be easier, as we do have a pure split &#8211;<\/p>\n<p>\\[Gini_{&lt;20,000} = 1 -\\left[\\left( \\frac{2}{2}\\right)^2 +\\left( \\frac{0}{2}\\right)^2 \\right] = 1-1=0\\]<\/p>\n<p>and<\/p>\n<p>\\[Gini_{&gt;20,000} = 1 -\\left[\\left( \\frac{0}{1}\\right)^2 +\\left( \\frac{1}{1}\\right)^2 \\right] = 1-1=0\\]<\/p>\n<p>Which gives us:<\/p>\n<p>\\[Gini_{20,000} = \\frac{1}{3}\\cdot 0 + \\frac{2}{3}\\cdot 0 = 0\\]<\/p>\n<p>Hence, when we minimize the Gini Index, we choose $20,000 as a split point, and our algorithm now looks like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-366 aligncenter\" src=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-300x179.png\" alt=\"\" width=\"300\" height=\"179\" srcset=\"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-300x179.png 300w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-65x39.png 65w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-225x135.png 225w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end-350x209.png 350w, https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-content\/uploads\/sites\/1653\/2024\/01\/tree_end.png 580w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>Of course, doing this by hand is time consuming!\u00a0 We really should be using a computer, if we had more than three points.<\/p>\n<h2>Entropy<\/h2>\n<p>Calculating these by hand is not a fun task, but we briefly look at a different measure, Entropy, which we are again trying to minimize.<\/p>\n<p>\\[Entropy_{box} = -\\frac{\\text{# in class}}{N} \\Sigma \\frac{\\text{#}}{N} \\cdot\u00a0 ln \\left(\\frac{\\text{#}}{N}\\right)\\]<\/p>\n<h2>continue later Amy!<\/h2>\n<p>&nbsp;<\/p>\n<h2>Decisions\u00a0 \/\u00a0 hyperparameters<\/h2>\n<ul>\n<li>Should we look at all features?\u00a0 Or just choose one randomly<\/li>\n<li>How many classes at each split point?<\/li>\n<li>How deep do we want the tree?<\/li>\n<\/ul>\n","protected":false},"author":883,"menu_order":3,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-336","chapter","type-chapter","status-publish","hentry"],"part":64,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/336","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/users\/883"}],"version-history":[{"count":25,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/336\/revisions"}],"predecessor-version":[{"id":396,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/336\/revisions\/396"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/parts\/64"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapters\/336\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/media?parent=336"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/pressbooks\/v2\/chapter-type?post=336"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/contributor?post=336"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/businessanalytics\/wp-json\/wp\/v2\/license?post=336"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}