该公司为借款服务平台,为广大小微型企业主、个体工商户、普通工薪阶层提供借款服务。采用多维数据对不同人群进行风险画像,实现对其风险进行充分评估,借助机器学习技术、大数据技术提升风险控制模型的迭代能力,提高风险控制效率。由于风控既要考虑风险,也要兼顾收益,风险控制模型要求用AUC评价模型的优劣,AUC越大,模型性能越优秀。
金融贷款违约预测思维导图
风控 金融违约预测.png
【一】理解数据
1.导入模块
import pandas as pd
import numpy as np
#Common Model Helpers
from sklearn import metrics
from sklearn.model_selection import train_test_split
#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
import gc
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
2.导入数据
df=pd.read_csv(r'D:/BaiduNetdiskDownload/天池/train.csv')
3.查看数据集信息
image.png
image.png
(1)数据量
- 训练集800,000行*47列(含标签列)
-
测试集200,000行*46列
image.png
image.png
(2)数据列均值、最大值、最小值及四分位值
#查看数据整体描述
df.describe()
image.png
f3=pd.melt(df,value_vars=numerical_feature)
g=sns.FacetGrid(f3,col="variable",col_wrap=4,sharex=False,sharey=False)
g=g.map(sns.boxplot,'value')
image.png
(3)数据空值数量和数据类型
df.info()
image.png
import toad
toad.detector.detect(df).sort_values('missing',ascending=False)
image.png
missing为缺失数据比例
df_isnull_sum=pd.DataFrame(df.isnull().sum())
df_isnull_sum=df_isnull_sum[df_isnull_sum[0]>0]
plt.figure(figsize=(20,5))
sns.set(style='darkgrid')
sns.barplot(y=df_isnull_sum.sort_values(0,ascending=False).index,x=df_isnull_sum.sort_values(0,ascending=False)[0],palette="Blues_r")
plt.title('Empty data summary')
image.png
上图为数据缺失量
(4)数据唯一值
for i in df.columns:
if df[i].nunique()<=1:
print(i,'列是唯一值列,只有数据:',df[i].nunique())
policyCode 列是唯一值列,只有数据: 1
toad.detector.detect(df).sort_values('unique',ascending=False)
image.png
(5)划分数字连续变量和离散变量
- 连续变量分布
def get_numercial_serial_features(data,feas):
numerical_serial_feature=[]
numerical_noserial_feature=[]
for fea in feas:
temp=data[fea].nunique()
if temp<=10:
numerical_noserial_feature.append(fea)
else:
numerical_serial_feature.append(fea)
return numerical_serial_feature,numerical_noserial_feature
numerical_feature=list(df.select_dtypes(exclude='object').columns)
category_feature=list(df.select_dtypes(include='object').columns)
label='isDefault'
numerical_feature.remove(label)
numerical_serial_feature,numerical_noserial_feature=get_numercial_serial_features(df,numerical_feature)
#数字特征分布可视化
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
f = pd.melt(df, value_vars=numerical_serial_feature)
g = sns.FacetGrid(f, col="variable", col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot,"value", kde_kws = {'bw' : 1}) #如不够bandwith,一般在画图函数中设置,运用Facegrid的情况在g.map中设置bandwith
plt.xticks(rotation=90)
image.png
- 离散变量分布
f1=pd.melt(df.loc[:,numerical_noserial_feature],value_vars=numerical_noserial_feature)
g=sns.FacetGrid(f1,col='variable',hue="variable",col_wrap=4, sharex=False, sharey=False)
g=g.map(sns.countplot,'value')
image.png
- 文本变量
#f2,((ax1,ax2,ax3),(ax4),(ax5,ax6,ax7))=plt.subplots(3,3,figsize=(20,10))
plt.figure(figsize=(30,30))
sns.set(style='darkgrid')
ax1=plt.subplot(331)
sns.countplot(data=df,x='grade',order=sorted(df['grade'].unique()),ax=ax1)
ax1.set_xticklabels(sorted(df['grade'].unique()),fontsize=15)
ax1.set_title('Amount of Grades',fontsize=20)
ax2=plt.subplot(332)
sns.countplot(data=df,x='subGrade',order=sorted(df['subGrade'].unique()),ax=ax2)
ax2.set_xticklabels(sorted(df['subGrade'].unique()),rotation=90,fontsize=15)
ax2.set_title('Amount of subGrades',fontsize=20)
ax3=plt.subplot(333)
employmentLength_order=['< 1 year','1 year','2 years','3 years', '4 years','5 years','6 years','7 years','8 years','9 years','10+ years']
sns.countplot(data=df,x='employmentLength',order=employmentLength_order,ax=ax3)
ax3.set_xticklabels(employmentLength_order,fontsize=15,rotation=45)
ax3.set_title('Amount of employmentLengths',fontsize=20)
ax4=plt.subplot(312)
sns.countplot(df['issueDate'],order=sorted(df['issueDate'].unique()),ax=ax4)
plt.xticks(range(1,len(df['issueDate'].unique()),3),rotation=45,fontsize=15)
ax4.set_title('Amount of issueDate',fontsize=20,ha='center')
a=df['earliesCreditLine'].apply(lambda x:x.split('-')[1])
ax5=plt.subplot(313)
sns.countplot(a,order=sorted(a.unique()),ax=ax5)
plt.xticks(range(1,len(a.unique()),1),rotation=45,fontsize=15)
ax5.set_title('Amount of earliesCreditLine',fontsize=20,ha='center')
plt.subplots_adjust(wspace =0.2, hspace =0.3)
plt.show()
plt.tight_layout()
image.png
(6)违约和非违约客户在贷款利率、分期付款金额、年收入上的分布差异
image.png
【二】特征工程
1.数据预处理
根据数据可视化的情况,先对文本特征进行数值转换。值得注意的是,贷款等级和贷款等级之子级的转换时,需根据等级有序对应数值,不能用简单编码自动转换,否则在后续进行机器学习的时候,算法会误解上述两个特征的值的意义。
#贷款等级 映射字典
grade_dict = {'A':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':6}
#工作年限映射字典
employmentLength_dict = {'1 year':1,'10+ years':10,'2 years':2,'3 years':3,'4 years':4,
'5 years':5,'6 years':6,'7 years':7,'8 years':8,'9 years':9,'< 1 year':0}
#定义贷款等级之子级等级 变换函数
def get_sub_grade(grade, sub):
return grade*10+int(sub[1])
#贷款发放的月份 变换函数
def trans_issueDate(issueDate):
year,month,day = issueDate.split('-')
return int(year)*12+int(month)-1
#借款人最早报告的信用额度开立的月份 变换函数
def trans_earliesCreditLine(earliesCreditLine):
month_dict = {"Jan":1, "Feb":2, "Mar":3, "Apr":4, "May":5, "Jun":6, "Jul":7, "Aug":8, "Sep":9, "Oct":10, "Nov":11, "Dec":12}
month,year = earliesCreditLine.split('-')
month = month_dict[month]
return int(year)*12+month-1
for df in dfs:
print(df.shape)
df['grade'] = df['grade'].apply(lambda x: x if x not in grade_dict else grade_dict[x])
df['subGrade'] = df.apply(lambda row: get_sub_grade(row['grade'],row['subGrade']), axis=1)
df['employmentLength'] = df['employmentLength'].apply(lambda x: x if x not in employmentLength_dict else employmentLength_dict[x])
#df['issueYear'] = df['issueDate'].apply(lambda x: int(x.split('-')[0]))
df['issueDate'] = df['issueDate'].apply(lambda x: trans_issueDate(x))
df['earliesCreditLine'] = df['earliesCreditLine'].apply(lambda x: trans_earliesCreditLine(x))
df['dti'] = np.abs(df['dti'].fillna(1000))
2.特征构造
(1)构建业务特征
- date_Diff:
开立信用额度后多久才借款=贷款发放的月份-信用额度开立的月份 - installment_term_revolBal:
还款总金额与周转余额比=分期付款金额 * 12 * 贷款期限/信贷周转余额合计 - revolUtil_revolBal:
可用信贷额度与周转余额比=循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额/信贷周转余额合计 - openAcc_totalAcc:
未结信用数量与当前信用案件数量=借款人信用档案中未结信用额度的数量/借款人信用档案中当前的信用额度总数 - loanAmnt_dti_annualIncome:
在本公司贷款金额占个人全部债务金额比例=贷款金额/(债务收入比*年收入=总债务) - annualIncome_loanAmnt:
收入贷款金额比=年收入/贷款金额 - revolBal_loanAmnt:
信贷周转余额合计/贷款金额 - revolBal_installment:
信贷周转余额合计/分期付款金额 - annualIncome_installment:
年收入/分期付款金额
dfs=[train_data, test_data]
concated_df = pd.concat(dfs)
for df in dfs:
df['date_Diff'] = df['issueDate'] - df['earliesCreditLine']
df['installment_term_revolBal'] = df['installment']*12*df['term']/(df['revolBal']+0.1)
df['revolUtil_revolBal'] = df['revolUtil']/(df['revolBal']+0.1)
df['openAcc_totalAcc'] = df['openAcc']/df['totalAcc']
df['loanAmnt_dti_annualIncome'] = df['loanAmnt']/(np.abs(df['dti'])*df['annualIncome']+0.1)
df['employmentLength_bin'] = df['employmentLength']
df['issueDate_bin'] = df['issueDate']
df['earliesCreditLine_bin'] = df['earliesCreditLine']
df['term_bin'] = df['term']
df['homeOwnership_bin'] = df['homeOwnership']
df['annualIncome_loanAmnt'] = df['annualIncome']/(df['loanAmnt']+0.1)
df['revolBal_loanAmnt'] = df['revolBal']/(df['loanAmnt']+0.1)
df['revolBal_installment'] = df['revolBal']/(df['installment']+0.1)
df['annualIncome_installment'] = df['annualIncome']/(df['installment']+0.1)
(2)对连续特征进行分箱操作
label_lst = []
# 把分箱后的特征做为类别特征处理
#annualIncome、loanAmnt分成10份
bin_number = 10
for i in range(bin_number):
label_lst.append(i)
dfs[0]['annualIncome_bin'] = pd.qcut(concated_df['annualIncome'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['loanAmnt_bin'] = pd.qcut(concated_df['loanAmnt'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[1]['annualIncome_bin'] = pd.qcut(concated_df['annualIncome'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['loanAmnt_bin'] = pd.qcut(concated_df['loanAmnt'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
#interestRate、dti、installment、revolBal、revolUtil分成100份
label_lst = []
bin_number = 100
for i in range(bin_number):
label_lst.append(i)
dfs[0]['interestRate_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['dti_bin'] = pd.qcut(concated_df['dti'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['installment_bin'] = pd.qcut(concated_df['installment'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['revolBal_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[0]['revolUtil_bin'] = pd.qcut(concated_df['revolUtil'], bin_number, labels=label_lst,duplicates='drop')[:dfs[0].shape[0]]
dfs[1]['interestRate_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['dti_bin'] = pd.qcut(concated_df['dti'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['installment_bin'] = pd.qcut(concated_df['installment'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['revolBal_bin'] = pd.qcut(concated_df['revolBal'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
dfs[1]['revolUtil_bin'] = pd.qcut(concated_df['revolUtil'], bin_number, labels=label_lst,duplicates='drop')[dfs[0].shape[0]:]
(3)构建逻辑特征
- 通过连续特征['loanAmnt', 'installment', 'interestRate', 'annualIncome', 'dti', 'openAcc', 'revolBal', 'revolUtil', 'totalAcc']与类型特征['issueDate','employmentLength','purpose','homeOwnership']的交叉组合,构建新的特征。
- 以贷款金额loanAmnt和就业年限employmentLength为例,遍历employmentLength,例如当就业年限为2时,计算对应的贷款金额的中位数12000,则某客户的贷款金额和就业年限比率'loanAmnt_employmentLength_ratio'为某客户贷款金额loanAmnt/12000
for df in[train_data, test_data]:
for cate in cate_features:
df[cate] = df[cate].fillna(0).astype('int')
issueDate_lst = list(set(concated_df['issueDate']))
ratio_feat_lst = ['loanAmnt', 'installment', 'interestRate', 'annualIncome', 'dti', 'openAcc', \
'revolBal', 'revolUtil', 'totalAcc']
issueDate_lst = list(set(concated_df['issueDate']))
employmentLength_lst = list(set(concated_df['employmentLength']))
purpose_lst = list(set(concated_df['purpose']))
homeOwnership_lst = list(set(concated_df['homeOwnership']))
for feat in ratio_feat_lst:
issueDate_median = {}
issueDate_item_rank = {}
issueDate_label_mean = {}
for dt in issueDate_lst:
# 取最近6个月
mask = (concated_df['issueDate'] >= dt-3)&(concated_df['issueDate'] <= dt+3)
# 取最近6个月除去当前月份
mask_1 = (concated_df['issueDate'] >= dt-3)&(concated_df['issueDate'] <= dt+3)&(concated_df['issueDate'] != dt)
item_series = concated_df.loc[mask, feat]
label_series = concated_df.loc[mask_1, 'isDefault']
# 取最近6个月的中位数
issueDate_median[dt] = item_series.median()
issueDate_label_mean[dt] = label_series.mean()
item_rank = item_series.rank()/len(item_series)
issueDate_item_rank[dt] = {}
for item,rank in zip(item_series, item_rank):
issueDate_item_rank[dt][item] = rank
employmentLength_median = {}
for et in employmentLength_lst:
mask = concated_df['employmentLength'] == et
item_series = concated_df.loc[mask, feat]
employmentLength_median[et] = item_series.median()
purpose_median = {}
for pp in purpose_lst:
mask = concated_df['purpose'] == pp
item_series = concated_df.loc[mask, feat]
purpose_median[pp] = item_series.median()
homeOwnership_median = {}
for ho in homeOwnership_lst:
mask = concated_df['homeOwnership'] == ho
item_series = concated_df.loc[mask, feat]
homeOwnership_median[ho] = item_series.median()
for df in [train_data, test_data]:
print(feat, df.shape)
df['label_issueDate_mean'] = df['issueDate'].apply(lambda x: issueDate_label_mean[x])
df[feat+'_issueDate_median'] = df['issueDate'].apply(lambda x: issueDate_median[x])
#df['interestRate_ratio'] = df['interestRate']/df['interestRate_median']
df[feat+'_issueDate_ratio'] = df.fillna(0).apply(lambda r: issueDate_item_rank[r['issueDate']][r[feat]], axis=1)
df[feat+'_employmentLength_ratio'] = df.fillna(0).apply(lambda r: r[feat]/employmentLength_median[r['employmentLength']], axis=1)
df[feat+'_purpose_ratio'] = df.fillna(0).apply(lambda r: r[feat]/purpose_median[r['purpose']], axis=1)
df[feat+'_homeOwnership_ratio'] = df.fillna(0).apply(lambda r: r[feat]/homeOwnership_median[r['homeOwnership']], axis=1)
print(feat, df.shape)
image.png
共构造了67个新的特征
3.特征选择
利用TOAD的PSI模块寻找稳定性指标
feat_lst = list(test_data.columns[1:])
psi_df = toad.metrics.PSI(train_data[feat_lst], test_data[feat_lst]).sort_values(0)
psi_df
image.png
- PSI反映了验证样本在各分数段的分布与建模样本分布的稳定性。在建模中,我们常用来筛选特征变量、评估模型稳定性。如果模型不稳定,意味着模型不可控,对于业务本身而言就是一种不确定性风险,直接影响决策的合理性。这是不可接受的。
- 删除PSI大于0.25的特征。
feat_lst.remove('installment_homeOwnership_ratio')
feat_lst.remove('installment_purpose_ratio')
feat_lst.remove('revolBal_issueDate_ratio')
feat_lst.remove('revolBal_loanAmnt')
feat_lst.remove('annualIncome_installment')
feat_lst.remove('installment_issueDate_ratio')
feat_lst.remove('installment_employmentLength_ratio')
feat_lst.remove('revolUtil_issueDate_ratio')
feat_lst.remove('revolBal_purpose_ratio')
feat_lst.remove('revolBal_homeOwnership_ratio')
feat_lst.remove('revolBal_employmentLength_ratio')
feat_lst.remove('dti_issueDate_ratio')
【三】构建模型
1.模型选择
a.导入所需模块
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
b.封装模型
def xgb_model(X_train, y_train, X_test, y_test):
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = xgb.DMatrix(X_train_split , label=y_train_split)
valid_matrix = xgb.DMatrix(X_val , label=y_val)
test_matrix = xgb.DMatrix(X_test)
params = {
'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.04,
'tree_method': 'exact',
'seed': 2020,
'n_jobs': -1,
"silent": True,
}
watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
model = xgb.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
"""计算在验证集上的得分"""
val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
roc_auc = metrics.auc(fpr, tpr)
print('调参后xgboost单模型在验证集上的AUC:{}'.format(roc_auc))
"""对测试集进行预测"""
test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)
fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
roc_auc1 = metrics.auc(fpr, tpr)
print('调参后xgboost单模型在测试集上的AUC:{}'.format(roc_auc1))
return test_pred
def lgb_model(X_train, y_train, X_test, y_test):
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
# 调参后的最优参数
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'min_child_weight': 0.32,
'num_leaves': 14,
'max_depth': 4,
'feature_fraction': 0.81,
'bagging_fraction': 0.61,
'bagging_freq': 9,
'min_data_in_leaf': 13,
'min_split_gain': 0.27,
'reg_alpha': 9.58,
'reg_lambda': 4.62,
'seed': 2020,
'n_jobs':-1,
'silent': True,
'verbose': -1,
}
model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500, early_stopping_rounds=500)
"""计算在验证集上的得分"""
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
roc_auc = metrics.auc(fpr, tpr)
print('调参后lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
"""对测试集进行预测"""
test_pred = model.predict(X_test, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
roc_auc1 = metrics.auc(fpr, tpr)
print('调参后lightgbm单模型在测试集上的AUC:{}'.format(roc_auc1))
return test_pred
def cat_model(X_train, y_train, X_test, y_test):
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
model = CatBoostClassifier(iterations=2500, cat_features=cate_features,eval_metric='AUC',logging_level='Verbose',
learning_rate=0.05, depth=6, l2_leaf_reg=5, loss_function='CrossEntropy')
model.fit(X_train_split,y_train_split, eval_set=(X_val, y_val), plot=False)
"""计算在验证集上的得分"""
val_pred = model.predict_proba(X_val)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pred)
roc_auc = metrics.auc(fpr, tpr)
print('调参后lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
"""对测试集进行预测"""
test_pred = model.predict_proba(X_test)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test,test_pred)
roc_auc1 = metrics.auc(fpr, tpr)
print('调参后lightgbm单模型在测试集上的AUC:{}'.format(roc_auc1))
return test_pred
c.模型分数对比
- XGBoost
xgb_pred=xgb_model(Xtrain,Ytrain,Xtest, Ytest)
image.png
- LightGBM
lgb_pred=lgb_model(Xtrain,Ytrain,Xtest, Ytest)
image.png
- Catboost
cat_pred=cat_model(Xtrain,Ytrain,Xtest, Ytest)
image.png
最终选择分数0.74+的Catboost建模。
2.调整参数
from sklearn.model_selection import GridSearchCV
params = {'depth': [2,5,8],
'learning_rate' : [0.05,0.1,0.15],
'l2_leaf_reg': [2,5,8],
'iterations': [10000],
'early_stopping_rounds':[300],
'loss_function':['CrossEntropy','Logloss']
}
cb_estimator=CatBoostClassifier(cat_features=cate_features1,eval_metric='AUC',logging_level='Verbose')
X_train, X_validation, y_train, y_validation = train_test_split(train_data.loc[:, feat_lst],
train_data.loc[:, 'isDefault'], test_size=0.125,random_state=2)
cb_model = GridSearchCV(cb_estimator, param_grid = params, scoring="roc_auc", cv = 2)
cb_model.fit(X_train,y_train,eval_set=(X_validation,y_validation))
cb_model.best_params_
image.png
在接下来训练模型的时候,则按照best_params_的参数进行训练。
3.训练模型
model_lst = []
#离散特征变量
cate_features = ['employmentTitle', 'employmentLength_bin', 'purpose', 'postCode', 'subGrade', 'earliesCreditLine_bin', \
'regionCode', 'title', 'issueDate_bin', 'term_bin',\
'interestRate_bin', 'annualIncome_bin', 'loanAmnt_bin','homeOwnership_bin',\
'revolBal_bin','dti_bin','installment_bin','revolBal_bin','revolUtil_bin']
#
pred_data=pd.DataFrame()
for i in range(3):
X_train, X_validation, y_train, y_validation = train_test_split(train_data.loc[:, feat_lst],
train_data.loc[:, 'isDefault'],
test_size=0.125 , random_state=i*1000)
model = CatBoostClassifier(iterations=10000, cat_features=cate_features,eval_metric='AUC',logging_level='Verbose',
learning_rate=0.1, depth=6, l2_leaf_reg=5, loss_function='CrossEntropy',early_stopping_rounds=500)
print(X_train.loc[:, feat_lst].shape,
y_train.shape,
X_validation.loc[:, feat_lst].shape,
y_validation.shape)
model.fit(X_train.loc[:, feat_lst],y_train, eval_set=(X_validation.loc[:, feat_lst], y_validation), plot=False)
preds = model.predict_proba(test_data[feat_lst])[:, 1]
pred_data[i]=preds
pred_data
image.png
给预测结果赋予合适的权重
total_score=(0.7494391347+0.7497655984+0.7485861886)-0.74*3
first_weight=(0.7494391347-0.74)/total_score
second_weight=(0.7497655984-0.74)/total_score
third_weight=(0.7485861886-0.74)/total_score
pred_data['weight']=pred_data[0]*first_weight+pred_data[1]*second_weight+pred_data[2]*third_weight
pred_data
image.png
上传结果
submit=pd.DataFrame(np.arange(800000,1000000,1),columns=['id'])
submit['isDefault']=pred_data['weight']
submit[['id','isDefault']].to_csv('submit.csv', index=False)
最终,在9611名参赛者中排名43名,成绩在top 0.5%左右。
由于计算机性能不足和时间有限,还有很多可以改进的空间,例如通过微小调整模型参数的手段,使模型预测性能更为优秀;通过建立更多的新特征,提升模型正确率;通过更多的模型stacking,提高模型泛化能力。
image.png