dataset地址为:https://raw.githubusercontent.com/rasbt/python-machine-learning-book-2ndedition/
master/code/ch10/housing.data.txt.
data feature:
CRIM: 城市的人均犯罪率
ZN: 25,000 英尺以上的住房占的比例
INDUS: 每个城镇的非零售商业英亩的比例
CHAS: Charles River dummy variable (=1,如果靠近河岸)
NOX: 氧化氮的浓度 (parts per 10 million)
RM: 每个住宅的平均房间数
AGE: 1940之前建造的自用单位比例
DIS: 波士顿五个就业中心的加权距离
RAD: 到公路的可达指数
TAX: 每10000美元全额财产税税率
PTRATIO: 城镇学生与教师的比例wn
B: 1000(Bk - 0.63)^2, where Bk is the proportion of [非洲裔的美国人] by town
LSTAT:人口中地位较低的百分比
MEDV: Median value of owner-occupied homes in $1000s
采用seaborn包来对房价数据来做数据可视化.
(1)变量两两对比图,相同的两个变量之间以直方图展示,不同的变量则以散点图展示
代码如下:
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
>>> sns.pairplot(df[cols], size=2.5)
>>> plt.tight_layout()
>>> plt.show()
分析:我们可以看到有RM和房价之间的线性关系,文中的第五列(第四行)。此外,我们可以看到在直方图右下的插曲在散点图矩阵表征变量似乎是正常的分布,但包含几个偏离值。
(2)量化特征间的相似度
Pearson's r系数,表征两个特征的相关度,取值于【-1,1】,正相关时为1,不相关为0,负相关为-1。可调用numpy中的corrcoef函数来实现计算。
代码如下:
>>> import numpy as np
>>> cm = np.corrcoef(df[cols].values.T)
>>> sns.set(font_scale=1.5)
>>> hm = sns.heatmap(cm,
... cbar=True,
... annot=True,
... square=True,
... fmt='.2f',
... annot_kws={'size': 15},
... yticklabels=cols,
... xticklabels=cols)
>>> plt.show()
为了拟合线性回归模型,我们感兴趣的是那些具有我们的目标变量MEDV表征高相关。看以前相关矩阵,我们发现我们的目标变量表征显示最大使用LSTAT变量的相关性(0.74);然而,你可能记得从检查的散点图矩阵,明确存在非线性关系LSTAT和MEDV之间。另一方面,RM和表征之间的关系也比较高(0.70),且二者为线性关系。
(3)
定义线性回归模型函数
class LinearRegressionGD(object):
def __init__(self, eta=0.001, n_iter=20):
self.eta = eta
self.n_iter = n_iter
def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
self.w_[1:] += self.eta * X.T.dot(errors)
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def predict(self, X):
return self.net_input(X)
训练模型:
>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)
验证选择的epoch是否是的GD algorithm 收敛
>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)
右上图可知在epoch>=5的情况下,算法达到收敛
(4)验证一下线性回归模型与测试集的满足情况
def lin_regplot(X, y, model):
... plt.scatter(X, y, c='steelblue', edgecolor='white', s=70)
... plt.plot(X, model.predict(X), color='black', lw=2)
... return None
>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000s [MEDV] (standardized)')
>>> plt.show()
由上图可见,当房间数为3的情况下,线性回归的得到的结果显然不尽如人意。
(5)RANSAC algorithm
1、在数据中随机选择几个点设定为内群
2、计算适合内群的模型
3、把其他刚才没有选到的点带入刚才建立的模型中,计算是否为内群
4、记下内群数量
5、重复以上步骤多次
6、比较哪次计算中内群数量最多,内群最多的那次所建的模型就是我们所要求的解
>>> from sklearn.linear_model import RANSACRegressor
>>> ransac = RANSACRegressor(LinearRegression(),
... max_trials=100,
... min_samples=50,
... loss='absolute_loss',
... residual_threshold=5.0,
... random_state=0)
>>> ransac.fit(X, y)
maxmum=100,设置最大样本数量,使用min_samples = 50,我们设置了随机选择的最小数量样品至少要50个。使用“absolute_loss”的争论residual_metric参数,计算绝对的垂直距离在拟合线和样本点之间。通过设residual_threshold参数为5,我们只允许包括样本如果他们在内的垂直距离的直线距离在5在这个特殊数据集上运行良好的单元。
>>> inlier_mask = ransac.inlier_mask_
>>> outlier_mask = np.logical_not(inlier_mask)
>>> line_X = np.arange(3, 10, 1)
>>> line_y_ransac = ransac.predict(line_X[:, np.newaxis])
>>> plt.scatter(X[inlier_mask], y[inlier_mask],
... c='steelblue', edgecolor='white',
... marker='o', label='Inliers')
>>> plt.scatter(X[outlier_mask], y[outlier_mask],
... c='limegreen', edgecolor='white',
... marker='s', label='Outliers')
>>> plt.plot(line_X, line_y_ransac, color='black', lw=2)
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000s [MEDV]')
>>> plt.legend(loc='upper left')
>>> plt.show()
(6)使用测试集验证模型的结果
>>> from sklearn.model_selection import train_test_split
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.3, random_state=0)
>>> slr = LinearRegression()
>>> slr.fit(X_train, y_train)
>>> y_train_pred = slr.predict(X_train)
>>> y_test_pred = slr.predict(X_test)
plt.scatter(y_train_pred, y_train_pred - y_train,
... c='steelblue', marker='o', edgecolor='white',
... label='Training data')
>>> plt.scatter(y_test_pred, y_test_pred - y_test,
... c='limegreen', marker='s', edgecolor='white',
... label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, color='black', lw=2)
>>> plt.xlim([-10, 50])
>>> plt.show()
如果一个完美的预测,残差将是完全零,我们在实际和实际应用中可能永远不会遇到。然而,对于一个好的回归模型,我们期望误差是随机的。分布和残差应随机分布在中心线周围。
(7)polynomial term
1.添加一个二次的多项式:
from sklearn.preprocessing import PolynomialFeatures
>>> X = np.array([ 258.0, 270.0, 294.0, 320.0, 342.0,
... 368.0, 396.0, 446.0, 480.0, 586.0])\
... [:, np.newaxis]
>>> y = np.array([ 236.4, 234.4, 252.8, 298.6, 314.2,
... 342.2, 360.8, 368.0, 391.2, 390.8])
>>> lr = LinearRegression()
>>> pr = LinearRegression()
>>> quadratic = PolynomialFeatures(degree=2)
>>> X_quad = quadratic.fit_transform(X)
2. 训练一个线性回归模型来做比较:
>>> lr.fit(X, y)
>>> X_fit = np.arange(250,600,10)[:, np.newaxis]
>>> y_lin_fit = lr.predict(X_fit)
3.多项式变换特征的多元回归模型拟合回归:
>>> pr.fit(X_quad, y)
>>> y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))
4.画出结果:
>>> plt.scatter(X, y, label='training points')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear fit', linestyle='--')
>>> plt.plot(X_fit, y_quad_fit,
... label='quadratic fit')
>>> plt.legend(loc='upper left')
>>> plt.show()
从图中可以看出,多项式模型比线性对于训练数据有更好的特性。
(8)二次和三次的多项式和线性模型来预测房价
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> regr = LinearRegression()
# create quadratic features
>>> quadratic = PolynomialFeatures(degree=2)
>>> cubic = PolynomialFeatures(degree=3)
>>> X_quad = quadratic.fit_transform(X)
>>> X_cubic = cubic.fit_transform(X)
# fit features
>>> X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]
>>> regr = regr.fit(X, y)
>>> y_lin_fit = regr.predict(X_fit)
>>> linear_r2 = r2_score(y, regr.predict(X))
>>> regr = regr.fit(X_quad, y)
>>> y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
>>> quadratic_r2 = r2_score(y, regr.predict(X_quad))
>>> regr = regr.fit(X_cubic, y)
>>> y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
>>> cubic_r2 = r2_score(y, regr.predict(X_cubic))
# plot results
>>> plt.scatter(X, y, label='training points', color='lightgray')
>>> plt.plot(X_fit, y_lin_fit,
... label='linear (d=1), $R^2=%.2f$' % linear_r2,
... color='blue',
... lw=2,
... linestyle=':')
>>> plt.plot(X_fit, y_quad_fit,
... label='quadratic (d=2), $R^2=%.2f$' % quadratic_r2,
... color='red',
... lw=2,
... linestyle='-')
>>> plt.plot(X_fit, y_cubic_fit,
... label='cubic (d=3), $R^2=%.2f$' % cubic_r2,
... color='green',
... lw=2,
... linestyle='--')
>>> plt.xlabel('% lower status of the population [LSTAT]')
>>> plt.ylabel('Price in $1000s [MEDV]')
>>> plt.legend(loc='upper right')
>>> plt.show()
如我们所见,三次拟合捕捉到了房价与比的线性和二次拟合较好lstat。但是,我们应该意识到添加更多的多项式特征会增加模型的复杂度。因此增加的过度拟合的机会。因此,在实践中它总是建议在单独的测试数据集上评估模型的性能。
(9)决策树
代码中选择深度为3的决策树来训练模型
>>> from sklearn.tree import DecisionTreeRegressor
>>> X = df[['LSTAT']].values
>>> y = df['MEDV'].values
>>> tree = DecisionTreeRegressor(max_depth=3)
>>> tree.fit(X, y)
>>> sort_idx = X.flatten().argsort()
>>> lin_regplot(X[sort_idx], y[sort_idx], tree)
>>> plt.xlabel(' lower status of the population [LSTAT]')
>>> plt.ylabel('% Price in $1000s [MEDV]')
>>> plt.show()
(10)random forest
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test =\
... train_test_split(X, y,
... test_size=0.4,
... random_state=1)
>>> from sklearn.ensemble import RandomForestRegressor
>>> forest = RandomForestRegressor(n_estimators=1000,
... criterion='mse',
... random_state=1,
... n_jobs=-1)
>>> forest.fit(X_train, y_train)
>>> y_train_pred = forest.predict(X_train)
>>> y_test_pred = forest.predict(X_test)
>>> print('MSE train: %.3f, test: %.3f' % (
... mean_squared_error(y_train, y_train_pred),
... mean_squared_error(y_test, y_test_pred)))
MSE train: 1.642, test: 11.052
>>> print('R^2 train: %.3f, test: %.3f' % (
... r2_score(y_train, y_train_pred),
... r2_score(y_test, y_test_pred)))
R^2 train: 0.979, test: 0.878
>>> plt.scatter(y_train_pred,
... y_train_pred - y_train,
... c='steelblue',
... edgecolor='white',
... marker='o',
... s=35,
... alpha=0.9,
... label='Training data')
>>> plt.scatter(y_test_pred,
... y_test_pred - y_test,
... c='limegreen',
... edgecolor='white',
... marker='s',
... s=35,
... alpha=0.9,
... label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='black')
>>> plt.xlim([-10, 50])
>>> plt.show()
分析:尽管没有完全实现误差随机分布,但是显然比之前线性回归算法的性能更优。