Spark构建回归模型(二)

  • 回归模型的训练和应用
    Python提供了方便我们访问所有模型参数的方法,因此只要使用相关方法即可。可以通过引入相关模块,并调用train方法中的help函数查看这些方法的具体细节:

    from pyspark.mllib.regression import LinearRegressionWithSGD
    from pyspark.mllib.tree import DecisionTree
    help(LinearRegressionWithSGD.train)
    
  • 在 bike sharing 数据上训练回归模型
    首先训练线性模型并测试该模型在训练数据上的预测效果

    linear_model = LinearRegressionWithSGD.train(data,iterations=10,step=0.1,intercept=False)
    true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))
    >>> print "Linear Model predictions: " + str(true_vs_predicted.take(5))
    Linear Model predictions: [(16.0, 117.89250386724846), (40.0, 116.2249612319211), (32.0, 116.02369145779235), (13.0, 115.67088016754433), (1.0, 115.56315650834317)]
    

    在trainRegressor中使用默认参数来训练决策树模型(相当于深度为5的树)

    dt_model = DecisionTree.trainRegressor(data_dt,{})
    preds = dt_model.predict(data_dt.map(lambda p: p.features))
    actual = data.map(lambda p: p.label)
    true_vs_predicted_dt = actual.zip(preds)
    >>> print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))
    Decision Tree predictions: [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945), (13.0, 14.284023668639053), (1.0, 14.284023668639053)]
    >>> print "Decision Tree depth: " + str(dt_model.depth())
    Decision Tree depth: 5
    >>> print "Decision Tree number of nodes: " + str(dt_model.numNodes())
    Decision Tree number of nodes: 63
    
    
  • 评估回归模型的性能

    #计算平方误差函数实现
    def squared_error(actual, pred):
        return (pred - actual)**2
    #计算平均绝对误差MAE
    def abs_error(actual, pred):
        return np.abs(pred - actual)
    #计算均方根对数误差RMSLE
    def squared_log_error(pred, actual):
        return (np.log(pred + 1) - np.log(actual + 1))**2
    
  • 计算不同度量下的性能

  • 线性模型

    mse = true_vs_predicted.map(lambda (t, p): squared_error(t, p)).mean()
    mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()
    rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())
    >>> print "Linear Model - Mean Squared Error: %2.4f" % mse
    Linear Model - Mean Squared Error: 30679.4539
    >>> print "Linear Model - Mean Absolute Error: %2.4f" % mae
    Linear Model - Mean Absolute Error: 130.6429
    >>> print "Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle
    Linear Model - Root Mean Squared Log Error: 1.4653
    
    
  • 决策树

    mse_dt = true_vs_predicted_dt.map(lambda (t, p): squared_error(t, p)).mean()
    mae_dt = true_vs_predicted_dt.map(lambda (t, p): abs_error(t, p)).mean()
    rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())
    >>> print "Decision Tree - Mean Squared Error: %2.4f" % mse_dt
    Decision Tree - Mean Squared Error: 11611.4860
    >>> print "Decision Tree - Mean Absolute Error: %2.4f" % mae_dt
    Decision Tree - Mean Absolute Error: 71.1502
    >>> print "Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_dt
    Decision Tree - Root Mean Squared Log Error: 0.6251
    
    
  • 改进模型性能和参数调优

  • 变换目标变量

  • 对数变换的影响

    data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))
    model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)
    true_vs_predicted_log = data_log.map(lambda p:(np.exp(p.label),np.exp(model_log.predict(p.features))))
    mse_log = true_vs_predicted_log.map(lambda (t, p): squared_error(t,p)).mean()
    mae_log = true_vs_predicted_log.map(lambda (t, p): abs_error(t, p)).mean()
    rmsle_log = np.sqrt(true_vs_predicted_log.map(lambda (t, p): squared_log_error(t, p)).mean())
    >>> print "Mean Squared Error: %2.4f" % mse_log
    Mean Squared Error: 50685.5559
    >>> print "Mean Absolue Error: %2.4f" % mae_log
    Mean Absolue Error: 155.2955
    >>> print "Root Mean Squared Log Error: %2.4f" % rmsle_log
    Root Mean Squared Log Error: 1.5411
    >>> print "Non log-transformed predictions:\n" + str(true_vs_predicted.take(3))
    Non log-transformed predictions:
    [(16.0, 117.89250386724846), (40.0, 116.2249612319211), (32.0, 116.02369145779235)]
    >>> print "Log-transformed predictions:\n" + str(true_vs_predicted_log.take(3))
    Log-transformed predictions:
    [(15.999999999999998, 28.080291845456212), (40.0, 26.959480191001763), (32.0, 26.654725629458021)]
    
    
  • 下面对决策树模型做同样的分析:

    data_dt_log = data_dt.map(lambda lp:LabeledPoint(np.log(lp.label), lp.features))
    dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})
    preds_log = dt_model_log.predict(data_dt_log.map(lambda p:p.features))
    actual_log = data_dt_log.map(lambda p: p.label)
    true_vs_predicted_dt_log = actual_log.zip(preds_log).map(lambda (t,p): (np.exp(t), np.exp(p)))
    mse_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): squared_error(t, p)).mean()
    mae_log_dt = true_vs_predicted_dt_log.map(lambda (t, p): abs_error(t,p)).mean()
    rmsle_log_dt = np.sqrt(true_vs_predicted_dt_log.map(lambda (t, p):squared_log_error(t, p)).mean())
    >>> print "Mean Squared Error: %2.4f" % mse_log_dt
    Mean Squared Error: 14781.5760
    >>> print "Mean Absolue Error: %2.4f" % mae_log_dt
    Mean Absolue Error: 76.4131
    >>> print "Root Mean Squared Log Error: %2.4f" % rmsle_log_dt
    Root Mean Squared Log Error: 0.6406
    >>> print "Non log-transformed predictions:\n" + str(true_vs_predicted_dt.take(3))
    Non log-transformed predictions:
    [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945)]
    >>> print "Log-transformed predictions:\n" + str(true_vs_predicted_dt_log.take(3))
    Log-transformed predictions:
    [(15.999999999999998, 37.530779787154508), (40.0, 37.530779787154508), (32.0, 7.2797070993907287)]
    
    
  • 模型参数调优

  • 创建训练集和测试集来评估参数

    data_with_idx = data.zipWithIndex().map(lambda (k, v): (v, k))
    test = data_with_idx.sample(False, 0.2, 42)
    train = data_with_idx.subtractByKey(test)
    
    train_data = train.map(lambda (idx, p): p)
    test_data = test.map(lambda (idx, p) : p)
    train_size = train_data.count()
    test_size = test_data.count()
    >>> print "Training data size: %d" % train_size
    Training data size: 13934
    >>> print "Test data size: %d" % test_size
    Test data size: 3445
    >>> print "Total data size: %d " % num_data
    Total data size: 17379 
    >>> print "Train + Test size : %d" % (train_size + test_size)
    Train + Test size : 17379
    
    data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k, v): (v, k))
    test_dt = data_with_idx_dt.sample(False, 0.2, 42)
    train_dt = data_with_idx_dt.subtractByKey(test_dt)
    train_data_dt = train_dt.map(lambda (idx, p): p)
    test_data_dt = test_dt.map(lambda (idx, p) : p)
    
  • 参数设置对线性模型的影响

    def evaluate(train, test, iterations, step, regParam, regType, intercept):
        model = LinearRegressionWithSGD.train(train, iterations, step,regParam=regParam, regType=regType, intercept=intercept)
        tp = test.map(lambda p: (p.label, model.predict(p.features)))
        rmsle = np.sqrt(tp.map(lambda (t, p): squared_log_error(t, p)). mean())
        return rmsle
    
  • 迭代

    params = [1, 5, 10, 20, 50, 100]
    metrics = [evaluate(train_data, test_data, param, 0.01, 0.0, 'l2',
    False) for param in params]
    >>> print params
    [1, 5, 10, 20, 50, 100]
    >>> print metrics
    [2.8779465130028195, 2.0390187660391499, 1.7761565324837876, 1.5828778102209107, 1.4382263191764473, 1.4050638054019446]
    
    
  • 步长

    params = [0.01, 0.025, 0.05, 0.1, 1.0]
    metrics = [evaluate(train_data, test_data, 10, param, 0.0, 'l2',
    False) for param in params]
    >>> print params
    [0.01, 0.025, 0.05, 0.1, 1.0]
    >>> print metrics
    [1.7761565324837874, 1.4379348243997032, 1.4189071944747715, 1.5027293911925559, nan]
    
    
  • L2正则化

    params = [0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]
    metrics = [evaluate(train_data, test_data, 10, 0.1, param, 'l2',False) for param in params]
    >>> print params
    [0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]
    >>> print metrics
    [1.5027293911925559, 1.5020646031965639, 1.4961903335175231, 1.4479313176192781, 1.4113329999970989, 1.5379824584440471, 1.8279564444985841]
    
  • L1正则化

    params = [0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    metrics = [evaluate(train_data, test_data, 10, 0.1, param, 'l1',False) for param in params]
    >>> params = [0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    >>> metrics = [evaluate(train_data, test_data, 10, 0.1, param, 'l1',False) for param in params]
    >>> print params
    [0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    >>> print metrics
    [1.5027293911925559, 1.5026938950690176, 1.5023761634555697, 1.499412856617814, 1.4713669769550108, 1.7596682962964314, 4.7551250073268614]
    
    model_l1 = LinearRegressionWithSGD.train(train_data, 10, 0.1,regParam=1.0, regType='l1', intercept=False)
    model_l1_10 = LinearRegressionWithSGD.train(train_data, 10, 0.1,regParam=10.0, regType='l1', intercept=False)
    model_l1_100 = LinearRegressionWithSGD.train(train_data, 10, 0.1,regParam=100.0, regType='l1', intercept=False)
    >>> print "L1 (1.0) number of zero weights: " + str(sum(model_l1.weights.array == 0))
    L1 (1.0) number of zero weights: 4
    >>> print "L1 (10.0) number of zeros weights: " + str(sum(model_l1_10.weights.array == 0))
    L1 (10.0) number of zeros weights: 33
    >>> print "L1 (100.0) number of zeros weights: " + str(sum(model_l1_100.weights.array == 0))
    L1 (100.0) number of zeros weights: 58
    
    
  • 截距
    线性模型最后可以设置的参数表示是否使用截距(intercept)。截距是添加到权重向量的常数项,可以有效地影响目标变量的中值。如果数据已经被归一化,截距则没有必要。但是理论上截距的使用并不会带来坏处。

    params = [False, True]
    metrics = [evaluate(train_data, test_data, 10, 0.1, 1.0, 'l2', param) for param in params]
    >>> print params
    [False, True]
    >>> print metrics
    [1.4479313176192781, 1.4798261513419801]
    
    
  • 参数设置对决策树性能的影响

    def evaluate_dt(train, test, maxDepth, maxBins):
        model = DecisionTree.trainRegressor(train, {},
        impurity='variance', maxDepth=maxDepth, maxBins=maxBins)
        preds = model.predict(test.map(lambda p: p.features))
        actual = test.map(lambda p: p.label)
        tp = actual.zip(preds)
        rmsle = np.sqrt(tp.map(lambda (t, p): squared_log_error(t, p)).mean())
        return rmsle
    
  • 树深度

    params = [1, 2, 3, 4, 5, 10, 20]
    metrics = [evaluate_dt(train_data_dt, test_data_dt, param, 32) for param in params]
    >>> print params
    [1, 2, 3, 4, 5, 10, 20]
    >>> print metrics
    [1.0280339660196287, 0.92686672078778276, 0.81807794023407532, 0.74060228537329209, 0.63583503599563096, 0.42729311886162807, 0.45160118771289642]
    
    
  • 最大划分数

    params = [2, 4, 8, 16, 32, 64, 100]
    metrics = [evaluate_dt(train_data_dt, test_data_dt, 5, param) for param in params]
    >>> print params
    [2, 4, 8, 16, 32, 64, 100]
    >>> print metrics
    [1.3053120532822782, 0.81696140983649768, 0.75745322513058744, 0.61905245875374304, 0.63583503599563096, 0.63583503599563096, 0.63583503599563096]
    
    
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,755评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,369评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,799评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,910评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,096评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,159评论 3 411
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,917评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,360评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,673评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,814评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,509评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,156评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,123评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,641评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,728评论 2 351

推荐阅读更多精彩内容

  • mean to add the formatted="false" attribute?.[ 46% 47325/...
    ProZoom阅读 2,694评论 0 3
  • 1.感激今天结束和伙伴们一起尽兴的表演,又说又唱,玩开心就好。 2.感激山上大厨们最后一天丰盛的午餐,十四菜一汤,...
    不倒姑娘阅读 114评论 0 0
  • 微微有点长,来一点耐心,笔芯❤ 1.我是因为什么读了这本书? 猫爷剽悍行动营推荐书目。 2.这本书讲了什么内容? ...
    大爱赫敏阅读 2,269评论 0 3