In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions.
In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
Tree-based methods are simple and useful for interpretation.However, they typically are not competitive with the best supervised learning approaches, such as those seen in Chapters 6 and 7, in terms of prediction accuracy.
Hence in this chapter we also introduce bagging , random forests , and boosting . Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.模型的可解释性和预测精确性不可兼得。
The Basics of Decision Trees
Decision trees can be applied to both regression and classification problems.We first consider regression problems, and then move on to classification.
Regression Trees
Prediction via Stratification of the Feature Space
We now discuss the process of building a regression tree. Roughly speaking,there are two steps.
1. We divide the predictor space—that is, the set of possible values for X1,X2, . . .,Xp—into J distinct and non-overlapping regions,R1,R2, . . . , RJ.
2. For every observation that falls into the region Rj, we make the same prediction, which is simply the mean of the response values for the training observations in Rj.
We now elaborate on Step 1 above. How do we construct the regions R1, . . .,RJ? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes , for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxesR1, . . . , RJ that minimize the RSS, given by:
Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down , greedy approach that is known as recursive binary splitting.
Tree Pruning
The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex.
A smaller tree with fewer splits (that is, fewer regionsR1, . . .,RJ) might lead to lower
variance and better interpretation at the cost of a little bias.又是一种偏差换方差
Classification Trees
Just as in the regression setting, we use recursive binary splitting to grow a classification tree. However, in the classification setting, RSS cannot be used as a criterion for making the binary splits. A natural alternative to RSS is the classification error rate . Since we plan to assign an observation in a given region to the most commonly occurring class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:
Here ˆpmk represents the proportion of training observations in the mth region that are from the kth class. However, it turns out that classification error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferable. The Gini index is defined by
a measure of total variance across the K classes.
Advantages and Disadvantages of Trees
Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
Trees can easily handle qualitative predictors without the need to create dummy variables.
Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.对异常值容忍度差
However, by aggregating many decision trees, using methods like bagging , random forests , and boosting , the predictive performance of trees can be substantially improved. We introduce these concepts in the next section.
Bagging, Random Forests, Boosting
Bagging, random forests, and boosting use trees as building blocks to construct more powerful prediction models.
Bagging
The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is used in many situations in which it is hard or even impossible to directly compute the standard deviation of a quantity of interest. We see here that the bootstrap can be used in a completely different context, in order to improve statistical learning methods such as decision trees.
The decision trees discussed in Section 8.1 suffer from high variance . This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of n to p is moderately large.
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.we introduce it here because it is particularly useful and frequently used in the context of decision trees.
Recall that given a set of n independent observations Z1, . . . , Zn, each with variance σ2, the variance of the mean Z of the observations is given by σ2/n .
In other words, averaging a set of observations reduces variance .
In other words, we could calculate ˆf1(x),ˆf2(x), . . . ,ˆfB(x) using B separate training sets, and average them in order to obtain a single low-variance statistical learning model.
Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the bth bootstrapped training set in order to get ˆ f∗b(x ), and finally average all the predictions, to obtain:
This is called bagging.
While bagging can improve predictions for many regression methods, it is particularly useful for decision trees.To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions.
Hence each individual tree has high variance, but low bias. Averaging these B trees reduces the variance. Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure.
Random Forests
Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered,a random sample of m predictors is chosen as split candidates from the full set of p predictors.The split is allowed to use only one of those m predictors.A fresh sample of m predictors is taken at each split, and typically we choose m ≈root(p) is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors.
In other words, in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors. This may sound crazy, but it has a clever rationale. Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.
Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average (p − m )/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.
The main difference between bagging and random forests is the choice of predictor subset size m.For instance, if a random forest is built using m= p , then this amounts simply to bagging.
Using a small value of min building a random forest will typically be helpful when we have a large number of correlated predictors.
As with bagging, random forests will not overfit if we increase B, so in practice we use a value o f B sufficiently large for the error rate to have settled down.
Boosting
Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Here we restrict our discussion of boosting to the context of decision trees. bag和boosting都是通用方法。
Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.
Notably, each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially : each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
What is the idea behind this procedure? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly . Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y, as the response. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter din the algorithm.
By fitting small trees to the residuals, we slowly improve ˆ f in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals. In general, statistical learning approaches that learn slowly tend to perform well. Note that in boosting, unlike in bagging, the construction of each tree depends strongly on the trees that have already been grown.
We have just described the process of boosting regression trees. Boosting classification trees proceeds in a similar but slightly more complex way, and the details are omitted here.
Boosting has three tuning parameters:
1. The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur lowly if at all. We use cross-validation to select B.
2. The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small λ can require using a very large value of B in order to achieve good performance.
3. The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d= 1 works well, in which case each tree is a stump , consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally dis the interaction depth , and controls the interaction order of the boosted model, since d splits can involve at most d variables.