5 Decision Trees and split points
Making a decision on split points:
We are going to look at a very small example of a decision tree, and then expand a bit.
Here is our 3 item, two variable problem. We are trying to predict whether a customer purchased a new financial product, based solely on their income:
ID | Income | Purchase |
1 | $12,000 | 0 (NO) |
2 | $15,000 | 0 (NO) |
3 | $25,000 | 1 (YES) |
Our Tree will look like this:
What we have to do is calculate the split point: what should be our
What we do is sort all of our data in order by income. Our candidate split points are halfway between each consecutive pair.
Here we have:
For each of these, we will count the number of 0 and 1 in each tree, and try to pick the “best” one:

Now we calculate the probabilities of each outcome:
We will analyze this properly soon, but let’s redo this analysis for our second candidate split point:

This is called a “Pure Split Point” – that is, each box has either all 0 or all 1. Our probabilities are:
This seems like a better choice! But let’s quantify this:
There are three main methods of choosing the best split point, and unsurprisingly they all involve statistical measures of dispersion:
- Gini (named after a person, not an acronym!)
- Entropy
measure of independence
GINI
In the Gini method, we find the Gini index for each split point, and then compare. The Gini index is a type of weighted average of the probabiliteis that we calculated above: Here, I’m going to use the word “boxes” to describe each of the end points: being less than or greater than the split point. In our simple example we have two boxes. I will use classes to describe the outcomes that are possible: 0 or 1 in our case. It’s easier to look at classification that uses only 2 categories or classes, but we can extend to more classes. We will also use
We calculate this via the equation:
where
This is easier calculate than to understand the math!
Recall the chart:

We will calculate the Gini number for each part:
0 is the lowest a Gini Index can be, we call this a pure, and 0.5 is the worst case scenario.
The Index is given by:
Or:
Now we have the fun of repeating for our second split point, $20,000. This will be easier, as we do have a pure split –
and
Which gives us:
Hence, when we minimize the Gini Index, we choose $20,000 as a split point, and our algorithm now looks like this:
Of course, doing this by hand is time consuming! We really should be using a computer, if we had more than three points.
Entropy
Calculating these by hand is not a fun task, but we briefly look at a different measure, Entropy, which we are again trying to minimize.
continue later Amy!
Decisions / hyperparameters
- Should we look at all features? Or just choose one randomly
- How many classes at each split point?
- How deep do we want the tree?