"

5 Decision Trees and split points

Making a decision on split points:

We are going to look at a very small example of a decision tree, and then expand a bit.

Here is our 3 item, two variable problem.  We are trying to predict whether a customer purchased a new financial product, based solely on their income:

ID Income Purchase
1 $12,000 0 (NO)
2 $15,000 0 (NO)
3 $25,000 1 (YES)

Our Tree will look like this:

A simple tree diagram

What we have to do is calculate the split point: what should be our ?? in the diagram.

What we do is sort all of our data in order by income.  Our candidate split points are halfway between each consecutive pair.

Here we have:

  • $13,500=$12,000+$15,0002
  • $20,000=$15,000+$25,0002

For each of these, we will count the number of 0 and 1 in each tree, and try to pick the “best” one:

Tree with a split point of $13,500
There are now two boxes.

Now we calculate the probabilities of each outcome:

  • P(0|$13,500)=1
  • P(1|$13,500)=0
  • P(0|>$13,500)=0.5
  • P(0|$13,500)=0.5

We will analyze this properly soon, but let’s redo this analysis for our second candidate split point:

 

A tree at split point of $20k
This is called a Pure Split Point

This is called a “Pure Split Point” – that is, each box has either all 0 or all 1.  Our probabilities are:

  • P(0|$20,000)=1
  • P(1|$20,000)=0
  • P(0|>$20,000)=0
  • P(0|>$20,000)=1

This seems like a better choice!  But let’s quantify this:

There are three main methods of choosing the best split point, and unsurprisingly they all involve statistical measures of dispersion:

  • Gini (named after a person, not an acronym!)
  • Entropy
  • χ2 measure of independence

 

GINI

In the Gini method, we find the Gini index for each split point, and then compare.  The Gini index is a type of weighted average of the probabiliteis that we calculated above:  Here, I’m going to use the word “boxes” to describe each of the end points: being less than or greater than the split point.  In our simple example we have two boxes.  I will use classes to describe the outcomes that are possible:  0 or 1 in our case.  It’s easier to look at classification that uses only 2 categories or classes, but we can extend to more classes.  We will also use N as the total number of data points.  Here it is 3!

 

We calculate this via the equation:

GINIsplit=Σboxes[# of points in the boxNGinibox]

where

Ginibox=1Σclasses(number of 0 or 1 in box# of points in the box)2

This is easier calculate than to understand the math!

Recall the chart:

Tree with a split point of $13,500
There are now two classes:

We will calculate the Gini number for each part:

Gini<13,500=1[(# of 0 < 13,500# < 13,500)2+(# of 1< 13,500# < 13,500)2]=1[(11)2+(01)2]=1[1+0]=0   and: Gini>13,500=1[(# of 0 > 13,500# > 13,500)2+(# of 1> 13,500# > 13,500)2]=1[(12)2+(12)2]=1[14+14]=12

0 is the lowest a Gini Index can be, we call this a pure, and 0.5 is the worst case scenario.

The Index is given by:

Gini13,500=(# < 13,500Total)Gini<13,500+(# >13,500Total)Gini>13,500

Or:

Gini13,500=130+2312=13

Now we have the fun of repeating for our second split point, $20,000.  This will be easier, as we do have a pure split –

Gini<20,000=1[(22)2+(02)2]=11=0

and

Gini>20,000=1[(01)2+(11)2]=11=0

Which gives us:

Gini20,000=130+230=0

Hence, when we minimize the Gini Index, we choose $20,000 as a split point, and our algorithm now looks like this:

Of course, doing this by hand is time consuming!  We really should be using a computer, if we had more than three points.

Entropy

Calculating these by hand is not a fun task, but we briefly look at a different measure, Entropy, which we are again trying to minimize.

Entropybox=# in classNΣ#Nln(#N)

continue later Amy!

 

Decisions  /  hyperparameters

  • Should we look at all features?  Or just choose one randomly
  • How many classes at each split point?
  • How deep do we want the tree?

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Business Analytics Copyright © by Amy Goldlist is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book