参考:《文本数据挖掘》
1、相似度计算
p_load(stringdist)
# 越接近1相似度越高
stringsim("hello", "Hello", method = "lv")
## [1] 0.8
# 通过dist函数求iris数据集前四列的相互距离
# 默认为欧式距离
dist(t(iris[, 1:4]))
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Width 36.15785
## Petal.Length 28.96619 25.77809
## Petal.Width 57.18304 25.86407 33.86473
# 利用Pearson相关系数来表征不同变量之间的相似度
p_load(apcluster)
corSimMat(t(iris[, 1:4]))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
2、 聚类方法
划分聚类法:k-means聚类法、k-medoids聚类法等
层次聚类法:合成法(Agglomerative Clustering)和分割法(Divise Clustering)
p_load(tokenizers)
str_vec <- df$sku_name[1:5] %>%
paste0(collapse = " ") %>%
tokenize_words(strip_punct = T, strip_numeric = T,
simplify = T)
# Levenshtein距离计算
d <- adist(str_vec)
# 如果量纲不一致或极差很大,还需要提前中心化和标准化
# d_scale <- scale(d)
2.1 K-means聚类
# 确定分类数量
p_load(factoextra)
# method则设定了最小化损失函数的计算方法
fviz_nbclust(d, kmeans, method = "wss")
k=3以后就很难减少损失函数,因此设定k=3。
km <- kmeans(d, centers = 3)
# 查看分类
km$cluster
## [1] 1 2 2 1 3 2 2 1 2 2 1 3 2 1 1 1 1 2 2 1 1 3 1 3 3 1 2 2 3 2 2 2 1 2 2 1 2 2 2 3 2 2 2 2
## [45] 2 2 2 2 2 1 2 2 1 2 2 2 2 3 2 1 1 3 1 2 2 2 2 1 1 2 1 2 2 1 3 2 1 1 2
# 查看字符串分别属于哪个类
cbind(class = km$center, string = str_vec)
# 基于PCA的可视化方法呈现分类结果
fviz_cluster(km,
data = d,
# 椭圆
ellipse.type = "euclid",
# 防止标注交叠
repel = T,
# 绘制主题
ggtheme = theme_minimal())
2.2 PAM算法
可以缓解kmeans聚类的缺点。
p_load(cluster)
# pam算法进行最佳聚类数判断
fviz_nbclust(d, pam, method = "silhouette")
可以看到最佳聚类数为2。
pam <- pam(d, 2)
cbind(class = pam$clustering, string = str_vec)
## class string
## [1,] "1" "1000pcs"
## [2,] "1" "32mm"
## [3,] "1" "0.5ml"
## [4,] "1" "plastic"
## [5,] "2" "centrifuge"
## [6,] "1" "tube"
## [7,] "1" "test"
## [8,] "1" "tubing"
## [9,] "1" "vial"
## [10,] "1" "clear"
## [11,] (省略。。。)
# 聚类效果
fviz_cluster(pam, ellipse.type = "euclid",
repel = T,
ggtheme = theme_classic())
# 使用fpc包高效实现k-medoids方法
p_load(fpc)
# 设置K取值范围为1到10,返回最佳聚类结果
pam2 <- pamk(d, krange = 1:10)
pam2$nc
## [1] 2
PAM算法虽然解决了很多问题,但是它在处理大数据集的时候,对计算机内存要求很高,而且耗费时间也比较长。为了解决这个问题,CLARA(Clustering Large Applications)算法被提了出来。
2.3 CLARA算法
# 判断最佳聚类数
fviz_nbclust(d, clara, method = "silhouette") +
theme_classic()
# 聚类分析
clara_res <- clara(d, 2,
# 设定子集大小
samples = 50, pamLike = T)
# 显示聚类结果
cbind(class = clara_res$clustering, string = str_vec)
## class string
## [1,] "1" "1000pcs"
## [2,] "2" "32mm"
## [3,] "2" "0.5ml"
## [4,] "1" "plastic"
## [5,] "1" "centrifuge"
## [6,] "2" "tube"
## [7,] "2" "test"
## [8,] "2" "tubing"
## [9,] "2" "vial"
## (省略。。。)
# 可视化展示
fviz_cluster(clara_res, ellipse.type = "euclid",
repel = T, ggtheme = theme_classic())
2.4 层次聚类法
2.4.1 合成法
p_load(cluster)
# 为距离矩阵的行进行命名,方便显示结果
rownames(d) <- str_vec
# 聚类
# stand参数控制是否进行标准化(默认为FALSE),用metric参数控制样本距离的计算方法(默认为“euclidean”,即欧式距离)
# 用method参数设置聚类方法(默认为“average”)
res_agnes <- agnes(d)
# 查看分类结果
res_agnes
## Call: agnes(x = d)
## Agglomerative coefficient: 0.8408839
## Order of objects:
## [1] 1000pcs 1000pcs plastic plastic plastic plastic nonstick
## [8] 32mm 22mm 33mm 50ml 26ml 0.5ml 0.2ml
## [15] tube tube style size hinge with mini
## [22] tin test case pcr dab jars jars
## [29] home cork gold lot x box box
## [36] box boxes vial vials vials small small
## [43] glass glass clear candy zakka empty metal
## [50] 12pcs 10pcs tubing design wedding garden casket
## [57] silver novelty newest bottles bottles bottles storage
## [64] storage storage storage protable centrifuge centrifuge container
## [71] container containers organizer gardening capacity silicone households
## [78] transparent transparent
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 8.570 8.552 12.258 31.678
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method"
## [8] "order.lab" "data"
通过agnes函数求得的是样本之间的亲疏关系,而没有直接进行分类。如果要进行分类,可以指定分类的数量,然后用cutree函数实现
group_info <- cutree(res_agnes, k = 2)
group_info
## [1] 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1
## [45] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1
# 不进行分类
fviz_dend(res_agnes)
# 进行分类
fviz_dend(res_agnes, k = 2,
# 标志大小
cex = 0.5,
# 设定类别颜色
k_colors = c("#FC4E07", "#00AFBB"),
# 设定标志颜色
color_labels_by_k = T,
# 设定矩形边框
rect = T)
# 使用PCA方法对结果进行可视化
fviz_cluster(list(data = as_tibble(d), cluster = group_info),
palette = c("#FC4E07", "#00AFBB"),
ellipse.type = "convex",
repel = T,
show.clust.cent = F,
ggtheme = theme_minimal())
2.4.2 分割法
只需要将agnes函数换为diana函数即可。