Kmean cluster analysis
基本思想和hierarchical 类似,但是,Kmean需要首先知道有几个分组,然后再对数据进行分析。所以,通常的做法是,县也难怪hierarchical cluster来直观的判断下可能存在几种cluster的方案,然后用kmean对各个方案再进行cluster analysis。
使用以前的例子,产生一组数据
> set.seed(1234)
> par(mar = c(0,0,0,0))
> x <- rnorm(12, mean=rep(1:3, each=4), sd=0.2)
> y <- rnorm(12, mean=rep(c(1,2,1), each=4), sd=0.2)
> plot(x, y, col="blue", pch=19, cex=2)
> text(x+0.05, y+0.05, labels = as.character(1:12))
观察数据,我们能看出来,大致存在三个cluster。
现在,采用Kmean的方法进行分组
kcluster = kmeans(dataframe, 3)
> plot(dataframe, col=kcluster$cluster)
> points(kcluster$centers, col = 1:2, pch = 8)
然后,我们发现,实际上是这样cluster的。
采用上一节汽车的数据,我们在做一次cluster analysis
cars = read.delim(file = "http://www.stat.berkeley.edu/~s133/data/cars.tab",stringsAsFactors=FALSE)
> cars.use = cars[, -c(1,2)]
> medians = apply(cars.use, 2, median)
> mads = apply(cars.use, 2, mad)
> cars.use = scale(cars.use, center = medians, scale = mads)
> cars.kcluster = kmeans(cars.use, 3)
> cars.kcluster
和最初用hierarchical analysis做出的结果有点不一样。
library("clsuter")
> clusplot(cars.use,cars.kcluster$cluster, color=TRUE, shade=TRUE, labels=2, line=0)