为什么使用one-hot编码来处理离散型特征?
1、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot编码,将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点。
2、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.将离散特征通过one-hot编码映射到欧式空间,是因为,在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间。
3、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.
Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?
Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.
将离散型特征使用one-hot编码,确实会让特征之间的距离计算更加合理。比如,有一个离散型特征,代表工作类型,该离散型特征,共有三个取值,不使用one-hot编码,其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。两个工作之间的距离是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗?显然这样的表示,计算出来的特征的距离是不合理。那如果使用one-hot编码,则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的,显得更合理。
将离散型特征进行one-hot编码的作用,是为了让距离计算更合理,但如果特征是离散的,并且不用one-hot编码就可以很合理的计算出距离,那么就没必要进行one-hot编码,比如,该离散特征共有1000个取值,我们分成两组,分别是400和600,两个小组之间的距离有合适的定义,组内的距离也有合适的定义,那就没必要用one-hot 编码。
离散特征进行one-hot编码后,编码后的特征,其实每一维度的特征都可以看做是连续的特征。就可以跟对连续型特征的归一化方法一样,对每一维特征进行归一化。比如归一化到[-1,1]或归一化到均值为0,方差为1。
有些情况不需要进行特征的归一化:
It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。
基于树的方法是不需要进行特征的归一化,例如随机森林,bagging 和 boosting等。基于参数的模型或基于距离的模型,都是要进行特征的归一化。
Tree Model不太需要one-hot编码
对于决策树来说,one-hot的本质是增加树的深度
tree-model是在动态的过程中生成类似 One-Hot + Feature Crossing 的机制
- 一个特征或者多个特征最终转换成一个叶子节点作为编码 ,one-hot可以理解成三个独立事件
- 决策树是没有特征大小的概念的,只有特征处于他分布的哪一部分的概念
one-hot可以解决线性可分问题 但是比不上label econding
one-hot降维后的缺点:
降维前可以交叉的降维后可能变得不能交叉
树模型的训练过程:
从根节点到叶子节点整条路中有多少个节点相当于交叉了多少次,所以树的模型是自行交叉
eg:是否是长的 { 否(是→ 柚子,否 → 苹果) ,是 → 香蕉 } 园 cross 黄 → 形状 (圆,长) 颜色 (黄,红) one-hot度为4的样本
使用树模型的叶子节点作为特征集交叉结果可以减少不必要的特征交叉的操作 或者减少维度和degree候选集
eg 2 degree → 8的特征向量 树 → 3个叶子节点
树模型:Ont-Hot + 高degree笛卡尔积 + lasso 要消耗更少的计算量和计算资源
这就是为什么树模型之后可以stack线性模型
nm的输入样本 → 决策树训练之后可以知道在哪一个叶子节点上 → 输出叶子节点的index → 变成一个n1的矩阵 → one-hot编码 → 可以得到一个n*o的矩阵(o是叶子节点的个数) → 训练一个线性模型
典型的使用: GBDT + RF
优点 : 节省做特征交叉的时间和空间
如果只使用one-hot训练模型,特征之间是独立的
对于现有模型的理解:(G(l(张量))):
其中:l(·)为节点的模型
G(·)为节点的拓扑方式
神经网络:l(·)取逻辑回归模型
G(·)取全连接的方式
决策树: l(·)取LR
G(·)取树形链接方式
创新点: l(·)取 NB,SVM 单层NN ,等
G(·)取怎样的信息传递方式