5 Decision Trees and split points
Making a decision on split points:
We are going to look at a very small example of a decision tree, and then expand a bit.
Here is our 3 item, two variable problem. We are trying to predict whether a customer purchased a new financial product, based solely on their income:
ID | Income | Purchase |
1 | $12,000 | 0 (NO) |
2 | $15,000 | 0 (NO) |
3 | $25,000 | 1 (YES) |
Our Tree will look like this:
What we have to do is calculate the split point: what should be our in the diagram.
What we do is sort all of our data in order by income. Our candidate split points are halfway between each consecutive pair.
Here we have:
For each of these, we will count the number of 0 and 1 in each tree, and try to pick the “best” one:

Now we calculate the probabilities of each outcome:
We will analyze this properly soon, but let’s redo this analysis for our second candidate split point:

This is called a “Pure Split Point” – that is, each box has either all 0 or all 1. Our probabilities are:
This seems like a better choice! But let’s quantify this:
There are three main methods of choosing the best split point, and unsurprisingly they all involve statistical measures of dispersion:
- Gini (named after a person, not an acronym!)
- Entropy
- measure of independence
GINI
In the Gini method, we find the Gini index for each split point, and then compare. The Gini index is a type of weighted average of the probabiliteis that we calculated above: Here, I’m going to use the word “boxes” to describe each of the end points: being less than or greater than the split point. In our simple example we have two boxes. I will use classes to describe the outcomes that are possible: 0 or 1 in our case. It’s easier to look at classification that uses only 2 categories or classes, but we can extend to more classes. We will also use as the total number of data points. Here it is 3!
We calculate this via the equation:
where
This is easier calculate than to understand the math!
Recall the chart:

We will calculate the Gini number for each part:
and:
0 is the lowest a Gini Index can be, we call this a pure, and 0.5 is the worst case scenario.
The Index is given by:
Or:
Now we have the fun of repeating for our second split point, $20,000. This will be easier, as we do have a pure split –
and
Which gives us:
Hence, when we minimize the Gini Index, we choose $20,000 as a split point, and our algorithm now looks like this:
Of course, doing this by hand is time consuming! We really should be using a computer, if we had more than three points.
Entropy
Calculating these by hand is not a fun task, but we briefly look at a different measure, Entropy, which we are again trying to minimize.
continue later Amy!
Decisions / hyperparameters
- Should we look at all features? Or just choose one randomly
- How many classes at each split point?
- How deep do we want the tree?