分析建模,日常问题整理(二十六)
2019.6.26~2019.7.30
- 变量分布稳定性
对抗分析检验,训练集和测试集的变量分布是否相似。
- 变量可解释性
sharplyvalue github地址
- 变量的伪相关
多次迭代分析,看变量重要性出现的次数,最高的几个可能是跟y伪相关的变量。
- 构造衍生变量
根据分类变量或者时间进行切片衍生
利用聚类的思想,计算多变量的统计值
from tqdm import tqdm
sparse_feature = ['X1', 'X2','X3'] ##离散变量
dense_feature = ['V1', 'V2', 'V3'] ## 连续变量
dat_all[dense_feature] = dat_all[dense_feature].astype(float)
def get_new_columns(name,aggs):
l=[]
for k in aggs.keys():
for agg in aggs[k]:
if str(type(agg))=="<class 'function'>":
l.append(name + '_' + k + '_' + 'other')
else:
l.append(name + '_' + k + '_' + agg)
return l
for d in tqdm(sparse_feature):
aggs={}
for s in sparse_feature:
aggs[s]=['count','nunique']
for den in dense_feature:
aggs[den]=['mean','max','min','std']
aggs.pop(d)
temp=dat_all.groupby(d).agg(aggs).reset_index()
temp.columns=[d]+get_new_columns(d,aggs)
dat_all=pd.merge(dat_all,temp,on=d,how='left')
dat_te[dense_feature] = dat_te[dense_feature].astype(float)
for d in tqdm(sparse_feature):
aggs={}
for s in sparse_feature:
aggs[s]=['count','nunique']
for den in dense_feature:
aggs[den]=['mean','max','min','std']
aggs.pop(d)
temp=dat_te.groupby(d).agg(aggs).reset_index()
temp.columns=[d]+get_new_columns(d,aggs)
dat_te =pd.merge(dat_te,temp,on=d,how='left')
- 解二元一次方程
from sympy import *
x,y = Symbol('x'),Symbol('y')
a = 7
b = 8
solve_ = solve([a* x+b*y-120, x + y - 20],[x, y])