Here, we will discuss two decision tree algorithms, mainly including ID3 and C4.5 algorithm. Firstly, let's define decision tree learning.
At the begining
Decision Tree Learing is the construction of a decision tree from class-labeled training tuples.
A decision tree is a flow-chart like strcuture, where each internal node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf node holds a class label. The topmost node in a tree is the root node.
--- From wiki
ID3 algorithm
Next, we now can take a look at the ID3 algorithm. ID3(Iterative Dichotomiser 3) ,which based on Ockham's Razor, is an algorithm invented by Ross Quinlan.
Ockham's Razor is a problem-solving principle attributed to William of Ockham, and was later used in many aspects of science research. To conclude this theory in short, simple theories are preferable to more complex ones becuase they are more testable. When we mean testable, it means how well our theories are performing on other new datasets.
And the intuition behind this algorithm is this Ockham's Razor, so that we always prefer the smaller decision trees over the larger ones. Also, in Information Theory, the less expected information the dataset has, the larger Information Gain is.
ID3 algorithm's core idea is to use Information Gain to determine whether a label should be choose, and choose the one that maximize the Information Gain after splitting the label.
Now, let's take a look at the example that follows.
We can tell that there is a total of 14 examples, which contains 9 positive examples and 5 negative ones. And we can calculate the entropy of current information
Suppose we look at the label outlook
to classify the output,
Then we can see that the dataset is divided into three parts, and each branch's Information Entropy is
And the Information Entropy after splitting can be calculated as
Then the final Information Gain after splitting with the label T is
Before splitting at each non-leaf node of the decision tree, we should first calculate the Information Gain each label may bring, and then choose the label that can maximize the Information Gain. As the larger Information Gain there is, the greater it can classify the data sample. This top-down greedy criterion is the core idea of ID3.
Summary of ID3
- Calculate the entropy of every attribute using the data set S
- Split the set S into subsets using the attribute for which the resulting entropy is minimum (or, information gain is maximum)
- Make a decision tree node containing thta attribute
- Recurse on subsets using remaining attributes.