决策树基本概念及算法优缺点

1. 什么是决策树

分类决策树模型是一种描述对实例进行分类的树形结构. 决策树由结点和有向边组成. 结点有两种类型: 内部结点和叶节点. 内部节点表示一个特征或属性, 叶节点表示一个类.
决策树(Decision Tree),又称为判定树, 是一种以树结构(包括二叉树和多叉树)形式表达的预测分析模型.

通过把实例从根节点排列到某个叶子节点来分类实例
叶子节点为实例所属的分类
树上每个节点说明了对实例的某个属性的测试, 节点的每个后继分支对应于该属性的一个可能值

2.决策树结构

决策树结构.png

3.决策树种类

分类树--对离散变量做决策树

回归树--对连续变量做决策树

4.决策树算法(贪心算法)

有监督的学习
非参数学习算法
自顶向下递归方式构造决策树
在每一步选择中都采取在当前状态下最好/优的选择

决策树学习的算法通常是一个递归地选择最优特征, 并根据该特征对训练数据进行分割, 使得各个子数据集有一个最好的分类的过程.
在决策树算法中,ID3基于信息增益作为属性选择的度量, C4.5基于信息增益作为属性选择的度量, CART基于基尼指数作为属性选择的度量

5.决策树学习过程

特征选择
决策树生成: 递归结构, 对应于模型的局部最优
决策树剪枝: 缩小树结构规模, 缓解过拟合, 对应于模型的全局选择

6.决策树优缺点

优点:
(1)速度快: 计算量相对较小, 且容易转化成分类规则. 只要沿着树根向下一直走到叶, 沿途的分裂条件就能够唯一确定一条分类的谓词.
(2)准确性高: 挖掘出来的分类规则准确性高, 便于理解, 决策树可以清晰的显示哪些字段比较重要, 即可以生成可以理解的规则.
(3)可以处理连续和种类字段
(4)不需要任何领域知识和参数假设
(5)适合高维数据
缺点:
(1)对于各类别样本数量不一致的数据, 信息增益偏向于那些更多数值的特征
(2)容易过拟合
(3)忽略属性之间的相关性

5.2 决策树数学知识

1.信息论:

若一事假有k种结果, 对应概率为 $P_i$ , 则此事件发生后所得到的信息量I为:
$\begin{split} I &= -(p_1*log_2(p_1) + p_2*log_2(p_2) + ... + p_k*log_2(p_k)) \\ &= - \sum_{i=1}^k{p_ilog_2p_i} \end{split}$

2.熵:

给定包含关于某个目标概念的正反样例的样例集S, 那么S相对这个布尔型分类的熵为:
$Entropy(S) \equiv -(p_{\oplus} * log_2p_{\oplus}) - (p_{\ominus} * log_2p_{\ominus})$
其中 $p_{\oplus}$ 代表正样例, $p_{\ominus}$ 代表反样例

3.条件熵:

假设随机变量(X,Y), 其联合分布概率为P(X=xi,Y=yi)=Pij, i=1,2,...,n;j=1,2,..,m
则条件熵H(Y|X)表示在已知随机变量X的条件下随机变量Y的不确定性, 其定义为X在给定条件下Y的条件概率分布的熵对X的数学期望
$H(X|Y) = \sum_{i-1}^n p_iH(Y|X=x_i)$

5.3 决策树算法Hunt

在Hunt算法中, 通过递归的方式建立决策树.

如果数据集D种所有的数据都属于一个类, 那么将该节点标记为节点.
如果数据集D中包含属于多个类的训练数据, 那么选择一个属性将训练数据划分为较小的子集, 对于测试条件的每个输出, 创建一个子节点, 并根据测试结果将D种的记录分布到子节点中, 然后对每一个子节点重复1,2过程, 对子节点的子节点依然是递归地调用该算法, 直至最后停止.

5.4 决策树算法ID3

1.分类系统信息熵

$H(C) = \sum_{i=1}^n P(C_i)*log_2P(C_i)$

2.条件熵

3.信息增益Gain(S, A) 定义

4.属性选择度量

使用信息增益, 选择最高信息增益的属性作为当前节点的测试属性

5.算法不足

使用ID3算法构建决策树时, 若出现各属性值取值数分布偏差大的情况, 分类精度会大打折扣
ID3算法本身并未给出处理连续数据的方法
ID3算法不能处理带有缺失值的数据集, 故在算法挖掘之前需要对数据集中的缺失值进行预处理
ID3算法只有树的生成, 所以该算法生成的树容易产生过拟合

6.算法流程

ID3(Examples,Target_attribute,Attributes)

Examples即训练样例集. Target_attribute是这棵树要预测的目标属性. Attributes是除目标属性外供学习到的决策树测试的属性列表. 返回能正确分类给定Examples的决策树.

创建树的Root结点

如果Examples都为正, 那么返回label=+的单节点树Root
如果Examples都为负, 那么返回label=-的单节点树Root
如果Attributes为空, 那么返回单节点树Root, label=Examples中最普通的Target_attribute值
否则
- A ← Attributes中分类Examples能力最好*的属性

7.算法Python实现

Python实现熵的计算

from math import log
def calcShanNonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*log(prob,2)
    return shannonEnt

# example
dataset = [[1],[2],[3],[3],]
sne = calcShanNonEnt(dataset)

print(sne)

Sklearn.tree参数介绍及使用建议

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

# Examples
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)

from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.3) 

res = clf.fit(X_train,y_train)
pre = clf.predict(X_test)
sco = clf.score(X_test, y_test)

print(y_test)
print(pre)
print(sco)

clf.apply(X_train)
clf.apply(X_test)

clf.decision_path(X_train)

type(clf.decision_path(X_train))

X_train.shape

clf.feature_importances_

from sklearn.tree import DecisionTreeClassifier


clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

clf.feature_importances_

clf.get_params()

clf.predict_log_proba(X_test)

clf.predict_proba(X_test)

DecisionTreeClassifier实例

限制决策树层数为4的DecisionTreeClassifier实例

from itertools import product

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier

# 使用iris数据
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target

# 训练模型, 限制树的最大深度为4
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(X,y)

# Plot
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, .1),
                    np.arange(y_min, y_max, .1))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=.8)
plt.show()

output_12_0.png

Plot the decision surfaces of ensembles of trees on the iris dataset

This plot compares the decision surfaces learned by a dcision tree classifier(first column), by a random forest classifier(second column), by an extra-trees classifier(third column) and by an AdaBoost classifier(fouth column).

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn import clone
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
                             AdaBoostClassifier)
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3
n_estimators = 30
cmap = plt.cm.RdYlBu
plot_step = 0.02
plot_step_coarser = 0.5
RANDOM_SEED = 13


# Load data
iris = load_iris()

plot_idx = 1

models = [DecisionTreeClassifier(max_depth=None),
         RandomForestClassifier(n_estimators=n_estimators),
         ExtraTreesClassifier(n_estimators=n_estimators),
         AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators)]

for pair in ([0,1], [0,2], [2,3]):
    for model in models:
        # print(pair, model)
        # only take the two correspoding features
        X = iris.data[:, pair]
        y = iris.target
        
        # Shuffle
        idx = np.arange(X.shape[0])
        np.random.seed(RANDOM_SEED)
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]
        
        # Standardize
        mean = X.mean(axis=0)
        std = X.std(axis=0)
        X = (X - mean) / std
        
        # Train
        clf = clone(model)
        clf = model.fit(X, y)
        
        scores = clf.score(X, y)
        # Create a title for each column and the console by using str() and
        # slicing away useless parts of the string
        model_title = str(type(model)).split(".")[-1][:-2][:-len('Classifier')]
        model_details = model_title
        if hasattr(model, "estimators_"):
            model_details +=  " with {} estimators".format(len(model.estimators_))
        print(model_details + " with features", pair,
             "has a score of", scores)
        
        plt.subplot(3, 4, plot_idx)
        if plot_idx <= len(models):
            # Add a title at the top of eeach column
            plt.title(model_title)
            
        # Now plot the decision boundary using a fine mesh as input to a filled contour plot
        x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1
        y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                            np.arange(y_min, y_max, plot_step))
        
        # Plot either a single DecisionTreeClassifier or alpha blend the
        # decision surfaces of the ensemble of classifiers
        if isinstance(model, DecisionTreeClassifier):
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            cs = plt.contourf(xx, yy, Z, cmap=cmap)
        else:
            # Choose alpha blend level with respect to the number of estimators
            # that are in use (nothing that AdaBoost can use fewer estimtors
            # than its maximum if it achieves a good enough fit early on)
            estimator_alpha = 1.0 / len(model.estimators_)
            print(len(model.estimators_))
            for tree in model.estimators_:
                Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
                Z = Z.reshape(xx.shape)
                cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)
                
        # Build a coarser grid to plot a set of ensemble classifications
        # to show how these are different to what we see in the decision
        # surfaces. These points are regularly space and do not have a
        # black outline
        xx_coarser, yy_coarser = np.meshgrid(
            np.arange(x_min, x_max, plot_step_coarser),
            np.arange(y_min, y_max, plot_step_coarser))
        Z_points_coarser = model.predict(np.c_[xx_coarser.ravel(),
                                              yy_coarser.ravel()]
                                        ).reshape(xx_coarser.shape)
        cs_points = plt.scatter(xx_coarser, yy_coarser, s=15,
                               c=Z_points_coarser, cmap=cmap,
                               edgecolors="none")
        plt.scatter(X[:, 0], X[:, 1], c=y,
                cmap=ListedColormap(['r', 'y', 'b']),
                edgecolor='k', s=20)

        plot_idx += 1

Output:

Automatically created module for IPython interactive environment
DecisionTree with features [0, 1] has a score of 0.9266666666666666
RandomForest with 30 estimators with features [0, 1] has a score of 0.9266666666666666
30
ExtraTrees with 30 estimators with features [0, 1] has a score of 0.9266666666666666
30
AdaBoost with 30 estimators with features [0, 1] has a score of 0.84
30
DecisionTree with features [0, 2] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [0, 2] has a score of 0.9933333333333333
30
ExtraTrees with 30 estimators with features [0, 2] has a score of 0.9933333333333333
30
AdaBoost with 30 estimators with features [0, 2] has a score of 0.9933333333333333
30
DecisionTree with features [2, 3] has a score of 0.9933333333333333
RandomForest with 30 estimators with features [2, 3] has a score of 0.9933333333333333
30
ExtraTrees with 30 estimators with features [2, 3] has a score of 0.9933333333333333
30
AdaBoost with 30 estimators with features [2, 3] has a score of 0.9933333333333333
30

output_14_1.png

Classifier comparison

A comparison of a several classifiers in scikit-learn on synthetic datasets.
The point of this examples is to illustrate the nature of decision boundaries of different classifiers.

Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.

print(__doc__)

# Code source: Gaël Varoquaux
#              Andreas Müller
# Mmodified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
# classifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02 # step size in the mesh

names = ['Nearest Neighbors', 'Linear SVM', 'RBF SVM', 'Gaussian Process', 'Decision Tree', 'Random Forest', 'Neural Net', 'AdaBoost',' Naive Bayes','QDA']

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear",C=0.025), # C is pantly parameter
    SVC(gamma=2, C=1), # kernel: rbf(default), gamma: Kernel coefficient
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1), # MultiLayer Perceptron
    AdaBoostClassifier(),
    GaussianNB(), # Gaussian Naive Bayes
    QuadraticDiscriminantAnalysis()    
]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, 
                          random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
           make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
           ]

figure = plt.figure(figsize=(27,9))

i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess datasset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = \
        train_test_split(X,y,test_size=.4,random_state=42)
    
    x_min, x_max = X[:,0].min() - .5, X[:,0].max() + .5
    y_min, y_max = X[:,1].min() - .5, X[:,0].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                        np.arange(y_min, y_max, h))
    
    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,
              edgecolors='k')
    # and testing points
    ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright, alpha=0.6,
              edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1
    
    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        
        # plot the decision boundary, For that, we will assign a color to each
        # point in the mesh [x_min, x_max]*[y_min, y_max].
        if hasattr(clf, 'decision_function'):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
        
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
        
        # plot also the training points
        ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,
                  edgecolors='k')
        # and testing points
        ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright,
                  edgecolors='k', alpha=.6)
        
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt ==  0:
            ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
               size=15, horizontalalignment='right')
        i += 1
        
plt.tight_layout()
plt.show()

Classifier comparison.png

Two-class AdaBoost

This example fits an AdaBoost decisin stump on a non-linearly separable classification dataset composed of two "Gaussian quantiles" clusters and plots the decision boundary and decision scores.

print(__doc__)

# Author: Noel Dawe <noel.dawe@gmail.com>
#
# License; BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles


# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
                                n_samples=200, n_features=2,
                                n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3,3), cov=1.5,
                                n_samples=300, n_features=2,
                                n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2 + 1))


# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                        algorithm='SAMME',
                        n_estimators=200)
bdt.fit(X, y)


plot_colors = 'br'
plot_step = .02
class_names = 'AB'

plt.figure(figsize=(10,5))

# plot the decision boundaries
plt.subplot(121)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                    np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
               c=c, cmap=plt.cm.Paired, 
               s=20, edgecolor='k',
               label=("Class %s" % n))
    
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

# Plot the two-class decision scors
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.subplot(122)
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(twoclass_output[y == i],
            bins=10,
            range=plot_range,
            facecolor=c,
            label=('Class %s' % n),
            alpha=.5,
            edgecolor='k')
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.ylabel('Samples')
plt.xlabel('Score')
plt.title('Decision Scores')

plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()