【风控】金融贷款违约预测

该公司为借款服务平台,为广大小微型企业主、个体工商户、普通工薪阶层提供借款服务。采用多维数据对不同人群进行风险画像,实现对其风险进行充分评估,借助机器学习技术、大数据技术提升风险控制模型的迭代能力,提高风险控制效率。由于风控既要考虑风险,也要兼顾收益,风险控制模型要求用AUC评价模型的优劣,AUC越大,模型性能越优秀。


金融贷款违约预测思维导图

风控 金融违约预测.png

【一】理解数据

1.导入模块
import pandas as pd
import numpy as np

#Common Model Helpers
from sklearn import metrics
from sklearn.model_selection import train_test_split

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False  

import gc
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
2.导入数据
df=pd.read_csv(r'D:/BaiduNetdiskDownload/天池/train.csv')
3.查看数据集信息
image.png

image.png
(1)数据量
  • 训练集800,000行*47列(含标签列)
  • 测试集200,000行*46列


    image.png

    image.png
(2)数据列均值、最大值、最小值及四分位值
#查看数据整体描述
df.describe()
image.png
f3=pd.melt(df,value_vars=numerical_feature)
g=sns.FacetGrid(f3,col="variable",col_wrap=4,sharex=False,sharey=False)
g=g.map(sns.boxplot,'value')
image.png
(3)数据空值数量和数据类型
df.info()
image.png
import toad
toad.detector.detect(df).sort_values('missing',ascending=False)
image.png

missing为缺失数据比例

df_isnull_sum=pd.DataFrame(df.isnull().sum())
df_isnull_sum=df_isnull_sum[df_isnull_sum[0]>0]
plt.figure(figsize=(20,5))
sns.set(style='darkgrid')
sns.barplot(y=df_isnull_sum.sort_values(0,ascending=False).index,x=df_isnull_sum.sort_values(0,ascending=False)[0],palette="Blues_r")
plt.title('Empty data summary')
image.png

上图为数据缺失量

(4)数据唯一值
for i in df.columns:
    if df[i].nunique()<=1:
        print(i,'列是唯一值列,只有数据:',df[i].nunique())

policyCode 列是唯一值列,只有数据: 1

toad.detector.detect(df).sort_values('unique',ascending=False)
image.png
(5)划分数字连续变量和离散变量
  • 连续变量分布
def get_numercial_serial_features(data,feas):
    numerical_serial_feature=[]
    numerical_noserial_feature=[]
    for fea in feas:
        temp=data[fea].nunique()
        if temp<=10:
            numerical_noserial_feature.append(fea)
        else:
            numerical_serial_feature.append(fea)
    return numerical_serial_feature,numerical_noserial_feature

numerical_feature=list(df.select_dtypes(exclude='object').columns)
category_feature=list(df.select_dtypes(include='object').columns)
label='isDefault'
numerical_feature.remove(label)
numerical_serial_feature,numerical_noserial_feature=get_numercial_serial_features(df,numerical_feature)

#数字特征分布可视化
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False  
f = pd.melt(df, value_vars=numerical_serial_feature)
g = sns.FacetGrid(f, col="variable",  col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot,"value", kde_kws = {'bw' : 1}) #如不够bandwith,一般在画图函数中设置,运用Facegrid的情况在g.map中设置bandwith
plt.xticks(rotation=90)
image.png
  • 离散变量分布
f1=pd.melt(df.loc[:,numerical_noserial_feature],value_vars=numerical_noserial_feature)
g=sns.FacetGrid(f1,col='variable',hue="variable",col_wrap=4, sharex=False, sharey=False)
g=g.map(sns.countplot,'value')
image.png
  • 文本变量
#f2,((ax1,ax2,ax3),(ax4),(ax5,ax6,ax7))=plt.subplots(3,3,figsize=(20,10))
plt.figure(figsize=(30,30))
sns.set(style='darkgrid')

ax1=plt.subplot(331)
sns.countplot(data=df,x='grade',order=sorted(df['grade'].unique()),ax=ax1)
ax1.set_xticklabels(sorted(df['grade'].unique()),fontsize=15)
ax1.set_title('Amount of Grades',fontsize=20)

ax2=plt.subplot(332)
sns.countplot(data=df,x='subGrade',order=sorted(df['subGrade'].unique()),ax=ax2)
ax2.set_xticklabels(sorted(df['subGrade'].unique()),rotation=90,fontsize=15)
ax2.set_title('Amount of subGrades',fontsize=20)

ax3=plt.subplot(333)
employmentLength_order=['< 1 year','1 year','2 years','3 years', '4 years','5 years','6 years','7 years','8 years','9 years','10+ years']
sns.countplot(data=df,x='employmentLength',order=employmentLength_order,ax=ax3)
ax3.set_xticklabels(employmentLength_order,fontsize=15,rotation=45)
ax3.set_title('Amount of employmentLengths',fontsize=20)

ax4=plt.subplot(312)
sns.countplot(df['issueDate'],order=sorted(df['issueDate'].unique()),ax=ax4)
plt.xticks(range(1,len(df['issueDate'].unique()),3),rotation=45,fontsize=15)
ax4.set_title('Amount of issueDate',fontsize=20,ha='center')

a=df['earliesCreditLine'].apply(lambda x:x.split('-')[1])
ax5=plt.subplot(313)
sns.countplot(a,order=sorted(a.unique()),ax=ax5)
plt.xticks(range(1,len(a.unique()),1),rotation=45,fontsize=15)
ax5.set_title('Amount of earliesCreditLine',fontsize=20,ha='center')

plt.subplots_adjust(wspace =0.2, hspace =0.3)
plt.show()
plt.tight_layout()
image.png
(6)违约和非违约客户在贷款利率、分期付款金额、年收入上的分布差异
image.png

【二】特征工程

1.数据预处理

根据数据可视化的情况,先对文本特征进行数值转换。值得注意的是,贷款等级和贷款等级之子级的转换时,需根据等级有序对应数值,不能用简单编码自动转换,否则在后续进行机器学习的时候,算法会误解上述两个特征的值的意义。

#贷款等级 映射字典
grade_dict = {'A':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':6}
#工作年限映射字典
employmentLength_dict = {'1 year':1,'10+ years':10,'2 years':2,'3 years':3,'4 years':4,
                         '5 years':5,'6 years':6,'7 years':7,'8 years':8,'9 years':9,'< 1 year':0}
#定义贷款等级之子级等级 变换函数
def get_sub_grade(grade, sub):
    return grade*10+int(sub[1])
#贷款发放的月份 变换函数
def trans_issueDate(issueDate):
    year,month,day = issueDate.split('-')
    return int(year)*12+int(month)-1
#借款人最早报告的信用额度开立的月份 变换函数
def trans_earliesCreditLine(earliesCreditLine):
    month_dict = {"Jan":1, "Feb":2, "Mar":3, "Apr":4, "May":5, "Jun":6, "Jul":7, "Aug":8, "Sep":9, "Oct":10, "Nov":11, "Dec":12}
    month,year = earliesCreditLine.split('-')
    month = month_dict[month]
    return int(year)*12+month-1
for df in dfs:
    print(df.shape)
    df['grade'] = df['grade'].apply(lambda x: x if x not in grade_dict else grade_dict[x])
    df['subGrade'] = df.apply(lambda row: get_sub_grade(row['grade'],row['subGrade']), axis=1)
    df['employmentLength'] = df['employmentLength'].apply(lambda x: x if x not in employmentLength_dict else employmentLength_dict[x])
    #df['issueYear'] = df['issueDate'].apply(lambda x: int(x.split('-')[0]))
    df['issueDate'] = df['issueDate'].apply(lambda x: trans_issueDate(x))
    df['earliesCreditLine'] = df['earliesCreditLine'].apply(lambda x: trans_earliesCreditLine(x))
    df['dti'] = np.abs(df['dti'].fillna(1000))
2.特征构造
(1)构建业务特征
  • date_Diff:
    开立信用额度后多久才借款=贷款发放的月份-信用额度开立的月份
  • installment_term_revolBal:
    还款总金额与周转余额比=分期付款金额 * 12 * 贷款期限/信贷周转余额合计
  • revolUtil_revolBal:
    可用信贷额度与周转余额比=循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额/信贷周转余额合计
  • openAcc_totalAcc:
    未结信用数量与当前信用案件数量=借款人信用档案中未结信用额度的数量/借款人信用档案中当前的信用额度总数
  • loanAmnt_dti_annualIncome:
    在本公司贷款金额占个人全部债务金额比例=贷款金额/(债务收入比*年收入=总债务)
  • annualIncome_loanAmnt:
    收入贷款金额比=年收入/贷款金额
  • revolBal_loanAmnt:
    信贷周转余额合计/贷款金额
  • revolBal_installment:
    信贷周转余额合计/分期付款金额
  • annualIncome_installment:
    年收入/分期付款金额
dfs=[train_data, test_data]
concated_df = pd.concat(dfs)
for df in dfs:
    df['date_Diff'] = df['issueDate'] - df['earliesCreditLine']
    df['installment_term_revolBal'] = df['installment']*12*df['term']/(df['revolBal']+0.1)
    df['revolUtil_revolBal'] = df['revolUtil']/(df['revolBal']+0.1)
    df['openAcc_totalAcc'] = df['openAcc']/df['totalAcc']
    df['loanAmnt_dti_annualIncome'] = df['loanAmnt']/(np.abs(df['dti'])*df['annualIncome']+0.1)
    df['employmentLength_bin'] = df['employmentLength']
    df['issueDate_bin'] = df['issueDate']
    df['earliesCreditLine_bin'] = df['earliesCreditLine']
    df['term_bin'] = df['term']
    df['homeOwnership_bin'] = df['homeOwnership']
    df['annualIncome_loanAmnt'] = df['annualIncome']/(df['loanAmnt']+0.1)
    df['revolBal_loanAmnt'] = df['revolBal']/(df['loanAmnt']+0.1)
    df['revolBal_installment'] = df['revolBal']/(df['installment']+0.1)
    df['annualIncome_installment'] = df['annualIncome']/(df['installment']+0.1)
(2)对连续特征进行分箱操作
label_lst = []
# 把分箱后的特征做为类别特征处理
#annualIncome、loanAmnt分成10份
bin_number = 10
for i in range(bin_number):
    label_lst.append(i)
dfs[0]['annualIncome_bin'] = pd.qcut(concated_df['annualIncome'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['loanAmnt_bin'] = pd.qcut(concated_df['loanAmnt'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[1]['annualIncome_bin'] = pd.qcut(concated_df['annualIncome'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['loanAmnt_bin'] = pd.qcut(concated_df['loanAmnt'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
#interestRate、dti、installment、revolBal、revolUtil分成100份
label_lst = []
bin_number = 100
for i in range(bin_number):
    label_lst.append(i)
dfs[0]['interestRate_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['dti_bin'] = pd.qcut(concated_df['dti'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['installment_bin'] = pd.qcut(concated_df['installment'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['revolBal_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['revolUtil_bin'] = pd.qcut(concated_df['revolUtil'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]

dfs[1]['interestRate_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['dti_bin'] = pd.qcut(concated_df['dti'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['installment_bin'] = pd.qcut(concated_df['installment'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['revolBal_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['revolUtil_bin'] = pd.qcut(concated_df['revolUtil'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
(3)构建逻辑特征
  • 通过连续特征['loanAmnt', 'installment', 'interestRate', 'annualIncome', 'dti', 'openAcc', 'revolBal', 'revolUtil', 'totalAcc']与类型特征['issueDate','employmentLength','purpose','homeOwnership']的交叉组合,构建新的特征。
  • 以贷款金额loanAmnt和就业年限employmentLength为例,遍历employmentLength,例如当就业年限为2时,计算对应的贷款金额的中位数12000,则某客户的贷款金额和就业年限比率'loanAmnt_employmentLength_ratio'为某客户贷款金额loanAmnt/12000
for df in[train_data, test_data]:
    for cate in cate_features:
        df[cate] = df[cate].fillna(0).astype('int')
issueDate_lst = list(set(concated_df['issueDate']))
ratio_feat_lst = ['loanAmnt', 'installment', 'interestRate', 'annualIncome', 'dti', 'openAcc', \
                  'revolBal', 'revolUtil', 'totalAcc']
issueDate_lst = list(set(concated_df['issueDate']))
employmentLength_lst = list(set(concated_df['employmentLength']))
purpose_lst = list(set(concated_df['purpose']))
homeOwnership_lst = list(set(concated_df['homeOwnership']))
for feat in ratio_feat_lst:
    issueDate_median = {}
    issueDate_item_rank = {}
    issueDate_label_mean = {}
    for dt in issueDate_lst:
        # 取最近6个月
        mask = (concated_df['issueDate'] >= dt-3)&(concated_df['issueDate'] <= dt+3)
        # 取最近6个月除去当前月份
        mask_1 = (concated_df['issueDate'] >= dt-3)&(concated_df['issueDate'] <= dt+3)&(concated_df['issueDate'] != dt)
        item_series = concated_df.loc[mask, feat]
        label_series = concated_df.loc[mask_1, 'isDefault']
        # 取最近6个月的中位数
        issueDate_median[dt] = item_series.median()
        issueDate_label_mean[dt] = label_series.mean()
        item_rank = item_series.rank()/len(item_series)
        issueDate_item_rank[dt] = {}
        for item,rank in zip(item_series, item_rank):
            issueDate_item_rank[dt][item] = rank
    employmentLength_median = {}
    for et in employmentLength_lst:
        mask = concated_df['employmentLength'] == et
        item_series = concated_df.loc[mask, feat]
        employmentLength_median[et] = item_series.median()
    purpose_median = {}
    for pp in purpose_lst:
        mask = concated_df['purpose'] == pp
        item_series = concated_df.loc[mask, feat]
        purpose_median[pp] = item_series.median()
    homeOwnership_median = {}
    for ho in homeOwnership_lst:
        mask = concated_df['homeOwnership'] == ho
        item_series = concated_df.loc[mask, feat]
        homeOwnership_median[ho] = item_series.median()
    for df in [train_data, test_data]:
        print(feat, df.shape)
        df['label_issueDate_mean'] = df['issueDate'].apply(lambda x: issueDate_label_mean[x])
        df[feat+'_issueDate_median'] = df['issueDate'].apply(lambda x: issueDate_median[x])
        #df['interestRate_ratio'] = df['interestRate']/df['interestRate_median']
        df[feat+'_issueDate_ratio'] = df.fillna(0).apply(lambda r: issueDate_item_rank[r['issueDate']][r[feat]], axis=1)
        df[feat+'_employmentLength_ratio'] = df.fillna(0).apply(lambda r: r[feat]/employmentLength_median[r['employmentLength']], axis=1)
        df[feat+'_purpose_ratio'] = df.fillna(0).apply(lambda r: r[feat]/purpose_median[r['purpose']], axis=1)
        df[feat+'_homeOwnership_ratio'] = df.fillna(0).apply(lambda r: r[feat]/homeOwnership_median[r['homeOwnership']], axis=1)
        print(feat, df.shape)
image.png

共构造了67个新的特征

3.特征选择

利用TOAD的PSI模块寻找稳定性指标

feat_lst = list(test_data.columns[1:])  
psi_df = toad.metrics.PSI(train_data[feat_lst], test_data[feat_lst]).sort_values(0)  
psi_df
image.png
  • PSI反映了验证样本在各分数段的分布与建模样本分布的稳定性。在建模中,我们常用来筛选特征变量、评估模型稳定性。如果模型不稳定,意味着模型不可控,对于业务本身而言就是一种不确定性风险,直接影响决策的合理性。这是不可接受的。
  • 删除PSI大于0.25的特征。
feat_lst.remove('installment_homeOwnership_ratio')  
feat_lst.remove('installment_purpose_ratio')  
feat_lst.remove('revolBal_issueDate_ratio')  
feat_lst.remove('revolBal_loanAmnt')  
feat_lst.remove('annualIncome_installment')  
feat_lst.remove('installment_issueDate_ratio')  
feat_lst.remove('installment_employmentLength_ratio')  
feat_lst.remove('revolUtil_issueDate_ratio')  
feat_lst.remove('revolBal_purpose_ratio')  
feat_lst.remove('revolBal_homeOwnership_ratio')  
feat_lst.remove('revolBal_employmentLength_ratio')  
feat_lst.remove('dti_issueDate_ratio')

【三】构建模型

1.模型选择

a.导入所需模块

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

b.封装模型

def xgb_model(X_train, y_train, X_test, y_test):
    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
    train_matrix = xgb.DMatrix(X_train_split , label=y_train_split)
    valid_matrix = xgb.DMatrix(X_val , label=y_val)
    test_matrix = xgb.DMatrix(X_test)

    params = {
        'booster': 'gbtree',
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'gamma': 1,
        'min_child_weight': 1.5,
        'max_depth': 5,
        'lambda': 10,
        'subsample': 0.7,
        'colsample_bytree': 0.7,
        'colsample_bylevel': 0.7,
        'eta': 0.04,
        'tree_method': 'exact',
        'seed': 2020,
        'n_jobs': -1,
        "silent": True,
    }
    watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
    model = xgb.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
    """计算在验证集上的得分"""
    val_pred  = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
    fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print('调参后xgboost单模型在验证集上的AUC:{}'.format(roc_auc))
    """对测试集进行预测"""
    test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)
    fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
    roc_auc1 = metrics.auc(fpr, tpr)
    print('调参后xgboost单模型在测试集上的AUC:{}'.format(roc_auc1))
    return test_pred

def lgb_model(X_train, y_train, X_test, y_test):
    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)
    
    # 调参后的最优参数
    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'learning_rate': 0.01,
        'min_child_weight': 0.32,
        'num_leaves': 14,
        'max_depth': 4,
        'feature_fraction': 0.81,
        'bagging_fraction': 0.61,
        'bagging_freq': 9,
        'min_data_in_leaf': 13,
        'min_split_gain': 0.27,
        'reg_alpha': 9.58,
        'reg_lambda': 4.62,
        'seed': 2020,
        'n_jobs':-1,
        'silent': True,
        'verbose': -1,
    }
    
    model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500, early_stopping_rounds=500)
    """计算在验证集上的得分"""
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print('调参后lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
    """对测试集进行预测"""
    test_pred = model.predict(X_test, num_iteration=model.best_iteration)
    fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
    roc_auc1 = metrics.auc(fpr, tpr)
    print('调参后lightgbm单模型在测试集上的AUC:{}'.format(roc_auc1))
    return test_pred

def cat_model(X_train, y_train, X_test, y_test):
    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
    model = CatBoostClassifier(iterations=2500,    cat_features=cate_features,eval_metric='AUC',logging_level='Verbose',
                               learning_rate=0.05, depth=6, l2_leaf_reg=5, loss_function='CrossEntropy')
    model.fit(X_train_split,y_train_split, eval_set=(X_val, y_val), plot=False)
    """计算在验证集上的得分"""
    val_pred = model.predict_proba(X_val)[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print('调参后lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
    """对测试集进行预测"""
    test_pred = model.predict_proba(X_test)[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
    roc_auc1 = metrics.auc(fpr, tpr)
    print('调参后lightgbm单模型在测试集上的AUC:{}'.format(roc_auc1))
    return test_pred

c.模型分数对比

  • XGBoost
xgb_pred=xgb_model(Xtrain,Ytrain,Xtest, Ytest)
image.png
  • LightGBM
lgb_pred=lgb_model(Xtrain,Ytrain,Xtest, Ytest)
image.png
  • Catboost
cat_pred=cat_model(Xtrain,Ytrain,Xtest, Ytest)
image.png

最终选择分数0.74+的Catboost建模。

2.调整参数
from sklearn.model_selection import GridSearchCV
params = {'depth': [2,5,8],
          'learning_rate' : [0.05,0.1,0.15],
          'l2_leaf_reg': [2,5,8],
          'iterations': [10000],
          'early_stopping_rounds':[300],
           'loss_function':['CrossEntropy','Logloss']  
         }
cb_estimator=CatBoostClassifier(cat_features=cate_features1,eval_metric='AUC',logging_level='Verbose')
X_train, X_validation, y_train, y_validation = train_test_split(train_data.loc[:, feat_lst],  
train_data.loc[:, 'isDefault'],  test_size=0.125,random_state=2) 

cb_model = GridSearchCV(cb_estimator, param_grid = params, scoring="roc_auc", cv = 2)
cb_model.fit(X_train,y_train,eval_set=(X_validation,y_validation))

cb_model.best_params_
image.png

在接下来训练模型的时候,则按照best_params_的参数进行训练。

3.训练模型
model_lst = []  
#离散特征变量
cate_features = ['employmentTitle', 'employmentLength_bin', 'purpose', 'postCode', 'subGrade', 'earliesCreditLine_bin', \
'regionCode', 'title', 'issueDate_bin', 'term_bin',\
'interestRate_bin', 'annualIncome_bin', 'loanAmnt_bin','homeOwnership_bin',\
'revolBal_bin','dti_bin','installment_bin','revolBal_bin','revolUtil_bin']  
#
pred_data=pd.DataFrame()
for i in range(3):  
    X_train, X_validation, y_train, y_validation = train_test_split(train_data.loc[:, feat_lst],  
train_data.loc[:, 'isDefault'],  
test_size=0.125 , random_state=i*1000)  
    model = CatBoostClassifier(iterations=10000,    cat_features=cate_features,eval_metric='AUC',logging_level='Verbose',  
learning_rate=0.1, depth=6, l2_leaf_reg=5, loss_function='CrossEntropy',early_stopping_rounds=500)  
    print(X_train.loc[:, feat_lst].shape,  
y_train.shape,  
X_validation.loc[:, feat_lst].shape,  
y_validation.shape)  
    model.fit(X_train.loc[:, feat_lst],y_train, eval_set=(X_validation.loc[:, feat_lst], y_validation), plot=False)  
    preds = model.predict_proba(test_data[feat_lst])[:, 1] 
    pred_data[i]=preds
pred_data
image.png

给预测结果赋予合适的权重

total_score=(0.7494391347+0.7497655984+0.7485861886)-0.74*3
first_weight=(0.7494391347-0.74)/total_score
second_weight=(0.7497655984-0.74)/total_score
third_weight=(0.7485861886-0.74)/total_score
pred_data['weight']=pred_data[0]*first_weight+pred_data[1]*second_weight+pred_data[2]*third_weight
pred_data
image.png

上传结果

submit=pd.DataFrame(np.arange(800000,1000000,1),columns=['id'])
submit['isDefault']=pred_data['weight']
submit[['id','isDefault']].to_csv('submit.csv', index=False)

最终,在9611名参赛者中排名43名,成绩在top 0.5%左右。
由于计算机性能不足和时间有限,还有很多可以改进的空间,例如通过微小调整模型参数的手段,使模型预测性能更为优秀;通过建立更多的新特征,提升模型正确率;通过更多的模型stacking,提高模型泛化能力。


image.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,657评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,889评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,057评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,509评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,562评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,443评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,251评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,129评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,561评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,779评论 3 335
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,902评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,621评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,220评论 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,838评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,971评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,025评论 2 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,843评论 2 354

推荐阅读更多精彩内容