Ch01. Ensemble Learning: Hype or Hallelujah?

This chapter covers

  • Defining and framing the ensemble learning problem
  • Motivating the need for ensembles in different applications
  • Understanding how ensembles handle fit vs. complexity
  • Implementing our first ensemble with ensemble diversity and model aggregation

“During the nearly 3 years of the Netflix competition, there were two main factors which improved the overall accuracy: the quality of the individual algorithms and the ensemble idea.
…the ensemble idea was part of the competition from the beginning and evolved over time. In the beginning, we used different models with different parametrization and a linear blending.
…[Eventually] the linear blend was replaced by a nonlinear one...”

1.1 Ensemble Methods: The Wisdom of the Crowds

The diagnostic procedure followed by Dr. Randy Forrest every time he gets a new case is to get opinions from his residents. His residents offer their diagnoses: either that the patient has cancer or has no cancer. Dr. Forrest then selects the majority answer as the final diagnosis put forth by his team.

Figure 1.1 Dr. Forrest embodies a diagnostic ensemble

Why? Because he knows that his residents are pretty smart, and a large number of pretty smart residents are all unlikely to make the same mistake. Here, Dr. Forrest relies on the power of model aggregating or model averaging: he knows that the average answer is most likely going to be a good one.

The secrets to his success, and indeed the success of ensemble methods as well, are:

  • ensemble diversity, so that he as a variety of opinions to choose from, and
  • model aggregation, so that he can combine them into a single final opinion.

“If you ask a large enough group of diverse and independent people to make a prediction or estimate a probability, the average of those answers will cancel out errors in individual estimation. Each person's guess, you might say, has two components: information and errors. Subtract the errors, and you're left with the information.”

An ensemble method is a machine-learning algorithm that aims to improve predictive performanc on a task by aggregating the predictions of multiple estimators or models. In this manner, an ensemble method learns a meta-estimator.

The key to success with ensemble methods is ensemble diversity. Informally, ensemble diversity refers to the fact that individual ensemble components, or machine-learning models, are different from each other.

Training such ensembles of diverse individual models is a key challenge in ensemble learning, and different approaches achieve this in different ways.

1.2 Why You Should Care About Ensemble Learning

One palpable success of ensemble methods is their domination of data science competitions (alongside deep learning), where they have been generally successful on different types of machine-learning tasks and application areas.

Indeed, the most popular way to tackle data science competitions these days is to combine feature engineering with ensemble methods.

Structured data is generally highly organized in tables, relational databases and other formats most of us are familiar with, and the type of data that ensemble methods have proven to be very successful on.

Unstructured data, in contrast, does not always have table structure. Images, Audio, video, waveform and text data are typically unstructured, which deep learning approaches -- including automated feature generation -- have demonstrated with great success.

..., ensemble methods can be combined with deep learning for unstructured problems as well.

Beyond competitions, ensemble methods drive data science in several areas including financial and business analytics, medicine and healthcare, cybersecurity, education, manufacturing, recommendation systems, entertainment and many more.

Figure 1.2 Which machine learning algorithm should I use for my data set?

1.3 Fit vs. Complexity in Individual Models

Machine learning tasks are typically:

  • supervised learning tasks, with a data set of labeled examples, where data has been annotated.
  • unsupervised learning tasks, with a data set of unlabeled examples, where the data lacks annotations.

Let’s say that we’re looking at the Boston Housing data set, which describes the median value of owner-occupied homes in 506 U.S. census tracts in the Boston area. The machine-learning task is to learn a regression model to predict the median home value in a census tract using different variables.

from sklearn.datasets import load_boston


# FutureWarning: Function load_boston is deprecated;
# `load_boston` is deprecated in 1.0 and will be removed in 1.2.
boston = load_boston()
import pandas as pd


df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
df['price'] = boston['target']
df.head()
Fig 1.3 Each row of this table is a training example, characterized by 13 features and a label (price).

Standardize the data to be zero-mean, unit standard deviation, and the labels to be in the range [-1, 1].

from sklearn.preprocessing import StandardScaler


X, y = load_boston(return_X_y=True)

X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(y.reshape(-1, 1))

1.3.1 Regression with Decision Trees

A decision tree is made of up of decision nodes and leaf nodes, and each decision node tests the current example for a specific condition, and funnels it to the right path or the left path based on the answer.

Figure 1.4 Decision trees partition the feature space into axis-parallel rectangles.
  • A decision tree of depth 1 is called a decision stump and is the simplest possible tree.
  • A shallow decision tree (say, depth 2 or 3) will have a small number of decision nodes and leaf nodes and is a simple model. Consequently, it is only able to represent simple functions.
  • a deeper decision tree will have many more decision nodes and leaf nodes and is a more complex model. A deeper decision tree, thus, will be able to represent richer and more complex functions.

Fit vs. Complexity in Decision Trees

The pseudo-code for our experiment is shown below:

for run = 1:5,
  (Xtrn, ytrn), (Xtst, ytst) = split data (X), labels (y) into training & test subsets randomly

  for d = 1:10,
    tree[d] = train a decision tree of depth d on the training subset (Xtrn, ytrn)
    train_scores[run, d] = compute R² score of tree[d] on the training set (Xtrn, ytrn)
    test_scores[run, d] = compute R² score of tree[d] on the training set (Xtst, ytst)

mean_train_score = average train_scores across runs
mean_test_score = average test_scores across runs
from sklearn.model_selection import ShuffleSplit


# set up 5 different random splits of the data into train and test sets
subsets = ShuffleSplit(n_splits=5, test_size=0.33, random_state=42)

sklearn.model_selection.ShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)

TIP: During modeling, we often have to split the data into a training and a test set. How big should these sets be? If the fraction of the data that makes up the training set is too small, the model will not have enough data. If the fraction of the data that makes up the test set is too small, there will be higher variation in our generalization estimates of how well the model performs on future data. A good rule of thumb (known as the Pareto principle) is to start with an 80%-20% train-test split.

from sklearn.tree import DecisionTreeRegressor


model = DecisionTreeRegressor()
from sklearn.model_selection import validation_curve


# for each split, train decision trees of depths from 1 to 10 
# and then evaluate on the test set
trn_scores, tst_scores = validation_curve(model,
                                          X,
                                          y,
                                          param_name='max_depth',
                                          param_range=range(1, 11),
                                          cv=subsets,
                                          scoring='r2')

Coefficient of Determination

  • The coefficient of determination (R2) is a measure of regression performance.
  • R2 is the proportion of variance in the true labels that is predictable from the features.
  • R2 depends on two quantities: (1) the total variance in the true labels, or total sum of squares (TSS), and (2) the mean squared error, or the residual sum of squares (RSS) between the true and predicted labels. We have R2 = (1 − RSS) /TSS.
  • A perfect model will have zero prediction error, or RSS = 0 and its corresponding R2 = 1.
  • Really good models have R2. values close to 1.
  • A really bad model will have high prediction error and high RSS. This means that for really bad models, we can have negative R2.
import numpy as np


mean_train_score = np.mean(trn_scores, axis=1)
mean_test_score = np.mean(tst_scores, axis=1)
import matplotlib.cm as cm
import matplotlib.colors as col


def get_color(colormap='viridis', n_colors=2, bounds=(0, 1)):
    cmap = cm.get_cmap(colormap)
    colors_rgb = cmap(np.linspace(bounds[0], bounds[1], num=n_colors))
    colors_hex = [col.rgb2hex(c) for c in colors_rgb]
    
    return colors_hex

col = get_color(colormap='RdBu')
%matplotlib inline
import matplotlib.pyplot as plt


fig = plt.figure()
plt.plot(range(1, 11),
         mean_train_score,
         linewidth=3,
         color=col[0],
         marker='o',
         markersize=8)
plt.plot(range(1, 11),
         mean_test_score,
         linewidth=3,
         color=col[1],
         marker='s',
         markersize=8)
plt.legend(['training score', 'test_score'],
           loc='lower center',
           ncol=2,
           fontsize=12)
plt.xlabel('Decision Tree Complexity, max_depth', fontsize=16)
plt.ylabel('$R^2$ coefficient', fontsize=16)
plt.xticks(range(1, 11))
plt.title('Decision Tree Regression', fontsize=16)
fig.tight_layout()
Figure 1.5 Comparing decision trees of different depths on the Boston Housing regression data set using *R<sup>2</sup>* as the evaluation metric.

As decision trees become deeper, they become more complex and achieve lower training errors. However, their ability to generalize to future data (estimated by test scores) does not keep decreasing. This is a rather counter-intuitive result: the model with the best fit on the training set is not necessarily the best model for predictions when deployed in the real world.

1.3.2 Regression with support vector machines

SVM training tries to find a model to minimize

The regularization term measures the flatness of the model: the more it is minimized, the more linear and less complex the learned model is.

The loss term measures the fit to the training data through a loss function (typically, mean squared error): the more it is minimized, the better the fit to the training data. The regularization parameter C trades-off between these two competing objectives:

  • a small value of C means the model will focus more on regularization and simplicity and less on training error, which causes the model to have higher training error and underfit;
  • a large value of C means the model will focus more on training error, learn more complex models, which causes the model to have lower training errors and possibly overfit.

CAUTION: SVMs identify “support vectors”, a smaller working set of training examples that the model depends on. Counting the number of support vectors is not an effective way to measure of model complexity as small values of C restrict the model more, forcing it to use more support vectors in the final model.

Fit vs. Complexity in Support Vector Machines

SVMs aim to minimize an objective function of the form

objective function = complexity(model) + C * loss(model, data).

As C increases, the loss term becomes more dominant, forcing the SVM to minimize the loss and improve the fit. As it does so however, for larger values of C, the complexity term is increasingly ignored and the model becomes more complex.

This behavior is visualized below for a simple 1d regression problem where the (synthetic) data is generated from the true function y=\frac{sinx}{x}:

n_syn = 100
X_syn = np.linspace(-10.0, 10.0, n_syn).reshape(-1, 1)
y_true = np.sin(X_syn) / X_syn
y_true = y_true.ravel()
y_syn = y_true + 0.125 * np.random.normal(0.0, 1.0, y_true.shape)
# y_syn (100, 1), y_true (100, )
# Add one very noisy point to illustrate (exaggeratedly), 
# the impact of overfitting
y_syn[-1] = -0.5  
from sklearn.svm import SVR
from sklearn.metrics import r2_score

fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))

for k, C in enumerate(10.0**np.arange(-3, 3)):
    # Find the correct axis row and column and
    # plot the noisy data and the true function
    i, j = np.divmod(k, 3)
    ax[i, j].scatter(X_syn[:, 0], y_syn, edgecolors='k', alpha=0.5)
    ax[i, j].plot(X_syn[:, 0],
                  y_true,
                  linewidth=1,
                  linestyle='--',
                  label='true')

    # Learn an SVM model for this value of C
    model = SVR(C=C, kernel='rbf', gamma=0.75)
    model.fit(X_syn, y_syn)
    y_pred = model.predict(X_syn)

    # Plot the learned SVM model for this value of C
    ax[i, j].plot(X_syn[:, 0],
                  y_pred,
                  linewidth=3,
                  linestyle='-',
                  label='learned')

    # Finish up the plots
    trn_score = r2_score(y_syn, y_pred)
    ax[i, j].set_title(
        f'C=$10^{{{int(np.log10(C))}}}$, trn score = {trn_score:3.2f}')

    # Put legend on one plot
    if k == 0:
        handles, labels = ax[i, j].get_legend_handles_labels()
        ax[i, j].legend(handles, labels, loc='upper left', fontsize=12)

fig.tight_layout()
Figure 1.6 Small values of C result in more linear (flatter) model, while large values of C, but more nonlinear and curvy models.

As C increases, the model moves from underfit to "good" fit. However, as C keeps increasing, the fit ultimately plateaus, though the model continues to become more nonlinear and complex. This increasing complexity makes it start deviating from the true underlying function and leads to overfitting, which ultimately hurts generalization on future data points.

Now we return to the Boston Housing data set and repeat the same experiment as we did with decision trees.

Perform 5 runs of the following:

  • Use the same subsets from the previous experiment with decision trees
  • Fit (train) SVRs of different with different C values (10-2, 10-1, ..., 103, 104) on the training set
  • Evaluate each of the trees on both the training set (to get the training score) and test set (to get the test score) using R2 as the scoring metric
model = SVR(degree=3)
trn_scores, tst_scores = validation_curve(model,
                                          X,
                                          y.ravel(),
                                          param_name='C',
                                          param_range=np.logspace(-2, 4, 7),
                                          cv=subsets,
                                          scoring='r2')

mean_train_score = np.mean(trn_scores, axis=1)
mean_test_score = np.mean(tst_scores, axis=1)
plt.semilogx(np.logspace(-2, 4, 7),
             mean_train_score,
             linewidth=3,
             color=col[0],
             marker='o',
             markersize=8)
plt.semilogx(np.logspace(-2, 4, 7),
             mean_test_score,
             linewidth=3,
             color=col[1],
             marker='s',
             markersize=8)
plt.legend(['training score', 'test_score'],
           loc='lower center',
           ncol=2,
           fontsize=12)
plt.xlabel('Regularization Parameter, C', fontsize=16)
plt.ylabel('$R^2$ coefficient', fontsize=16)
plt.title('Support Vector Regression', fontsize=16)
fig.tight_layout()
Figure 1.7 Comparing SVM regressors of different complexities on the Boston Housing data set using *R<sup>2</sup>* as the evaluation metric.

Every machinelearning algorithm, in fact, exhibits this behavior:

  • overly simple models tend to not fit the training data properly, and tend to generalize poorly on future data; a model that is performing poorly on training and test data is underfitting;
  • overly complex models can achieve very low training errors but tend to generalize poorly on future data too; a model that is performing very well on training data, but poorly on test data is overfitting;
  • the best models trade-off between complexity and fit, sacrificing a little bit of each during training so that they can generalize most effectively when deployed.

THE BIAS-VARIANCE TRADEOFF
What we have informally seen above as the fit vs. complexity tradeoff is more formally known as the bias-variance tradeoff.
The bias (error) of a model is the error arising from the impact of modeling assumptions (such as a preference for simpler models). The variance of a model is the error arising from sensitivity to small variations in the data set.
Highly complex models (low bias) will overfit the data and be more sensitive to noise (high variance), while simpler models (high bias) will underfit the data and be less sensitive to noise (low variance). This trade-off is inherent in every machine-learning algorithm. Ensemble methods seek to overcome this issue by combining several low-bias models to reduce their variance or combining several low-variance models to reduce their bias.

Figure 1.8 Predictions using model averaging, illustrated.

1.4 Our First Ensemble

Recall from the allegorical Dr. Forrest that an effective ensemble performs model aggregation on a set of diverse of component models. Here:

  1. We train a set of diverse base estimators (also known as base learners) using diverse base learning algorithms on the same data set. That is, we count on the significant variations in how each learning algorithm to produce a diverse set of base estimators.
  2. For a regression problem (such as the Boston Housing data), the predictions of
    individual base estimators are continuous. We can aggregate the results into one final ensemble prediction by simple averaging of the individual predictions.
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor


# initialize hyperparameters of each individual base estimator
estimators = {
    'krr': KernelRidge(kernel='rbf', gamma=0.25),
    'svr': SVR(gamma=0.5),
    'dtr': DecisionTreeRegressor(max_depth=8),
    'knn': KNeighborsRegressor(n_neighbors=3),
    'gpr': GaussianProcessRegressor(alpha=1e-1),
    'mlp': MLPRegressor(alpha=25, max_iter=1000)
}

for name, estimator in estimators.items():
    # train the individual base estimators
    estimator = estimator.fit(Xtrn, ytrn)
import numpy as np


n_estimators, n_samples = len(estimators), Xtst.shape[0]
y_individual = np.zeros((n_samples, n_estimators))
# initialize individual predictions
for i, (model, estimator) in enumerate(estimators.items()):
    # individual predictions using the base estimators
    y_individual[:, i] = estimator.predict(Xtst)
# aggregate (average) individual predictions
y_final = np.mean(y_individual, axis=1)
from itertools import combinations
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt


models = list(estimators.keys())
combo_mean = np.zeros((len(estimators), ))
combo_std = np.zeros((len(estimators), ))

fig = plt.figure(figsize=(10, 10))

for n_ensemble in range(len(estimators)):
    # Get all possible combinations of models of length n_ensemble
    combos = combinations(estimators, n_ensemble + 1)

    # Get the average of individual predictions for each combination
    averaged_predictions = [
        np.mean(np.array([y_individual[:, models.index(e)] for e in list(c)]),
                axis=0) for c in combos
    ]
    averaged_r2 = [r2_score(ytst, ypred) for ypred in averaged_predictions]

    n_combos = len(averaged_r2)

    plt.scatter(np.full((n_combos, ), n_ensemble + 1),
                averaged_r2,
                color='steelblue',
                alpha=0.5)
    combo_mean[n_ensemble] = np.mean(averaged_r2)
    combo_std[n_ensemble] = np.std(averaged_r2)

    if n_ensemble == 0:
        for r, name in zip(averaged_r2, estimators):
            plt.text(1.05, r, name, fontsize=8, verticalalignment='center')

plt.xlabel('Number of Models Ensembled', fontsize=14)
plt.ylabel('Coefficient of Determination, $R^2$', fontsize=14)
fig.tight_layout()
Figure 1.9 Prediction performance vs. ensemble size.
fig = plt.figure()
plt.fill_between(np.arange(1,
                           len(estimators) + 1),
                 combo_mean - combo_std,
                 combo_mean + combo_std,
                 color='orange',
                 alpha=0.25,
                 linewidth=0)
plt.plot(np.arange(1,
                   len(estimators) + 1),
         combo_mean,
         marker='o',
         color='orange',
         markersize=8,
         markeredgecolor='k',
         linewidth=3)
plt.xlabel('Number of Models Ensembled', fontsize=14)
plt.ylabel('Coefficient of Determination, $R^2$', fontsize=14)
fig.tight_layout()
Figure 1.10 The mean performance of the ensemble combinations increases, showing that bigger ensembles perform better.

As ensemble size increases, the variance of the ensemble decreases! This is a consequence of model aggregation or averaging. We know that averaging “smooths out the rough edges”. In the case of our ensemble, averaging individual predictions smooths out mistakes made by individual base estimators, replacing them instead with the wisdom of the ensemble: from many, one. The overall ensemble is more robust to mistakes, and unsurprisingly, generalizes better than any single base estimator.

1.5 Summary

  • Ensemble learning aims to improve predictive performance by training multiple models to train a meta-estimator. The component models of an ensemble are called base estimators or base learners.
  • Ensemble methods leverage the power of “the wisdom of crowds”, which relies on the principle that the collective opinion of a group is more effective than any single individual in the group.
  • Ensemble methods are widely used in several application areas including financial and business analytics, medicine and healthcare, cybersecurity, education, manufacturing, recommendation systems, entertainment and many more.
  • Most machine-learning algorithms contend with a fit vs. complexity (also called biasvariance) tradeoff, which affects their ability to generalize well to future data. Ensemble methods use multiple models to overcome this tradeoff.
  • An effective ensemble requires two key ingredients: (1) ensemble diversity and (2) model aggregation for the final predictions.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 225,448评论 6 524
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 96,648评论 3 406
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 172,816评论 0 370
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 61,288评论 1 304
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 70,294评论 6 401
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 53,739评论 1 316
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 42,076评论 3 431
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 41,071评论 0 280
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 47,632评论 1 327
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 39,637评论 3 347
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 41,755评论 1 355
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 37,344评论 5 351
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 43,069评论 3 341
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 33,487评论 0 25
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 34,646评论 1 277
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 50,342评论 3 384
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 46,813评论 2 367

推荐阅读更多精彩内容