13. Clustering

Clustering

Unsupervised learing introduction

K-means algorithm

Input:

  • K (number of clusters)
  • Training set \{x^{(1)},x^{(2)},...,x^{(n)}\}\quad x^{(i)}\in R^{n}

K-means algroithm

Randomly initialize K cluster centroids \mu_1,\mu_2,...,\mu_K \in R^{n}

Repeat {
for i = 1 to m
c^{(i)}:=index (from 1 to K) of cluster centroid closest to x^{(i)}
for k=1 to K
\mu_K:=average (mean) of points assigned to cluster k
}

K-means for non-separated clusters

Optimization Objective

K-means optimization objective

c^{(i)} = index of cluster (1,2,...,K) to which example x^{(i)} is currently assigned.
\mu_k = cluster centriod k (\mu_k\in R^{n})
\mu_c^{(i)} = cluster centroid of cluster to which example x^{(i)} has been assigned

Distortion:
J(c^{(i)},...,c^{(m)},\mu_1,...,\mu_k) = \frac{1}{m}\sum_{i=1}^m||x^{(i)}=\mu_c^{(i)}||^2

Random Initalization

  • K < m
  • Randomly pick K trianing examples.
  • Set \mu_1,...,\mu_k equal to these K examples.

Random initialization many times:

For i=1 to 100 {
Randomly initialize K-means.
Run K-means. Get c^{(1)},...,c^{(m)},\mu_1,...,\mu_K
Compute cost function (distortion)
J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_K)
}

Pick clustering that gave lowest cost J.

K=2-10, if K is very large, Randomly initilization many times may help less.

Choosing the number of clusters

  • by hand (see the data set's figure)
  • Elbow method: compute cost J form 1-..., and draw a figure of it. choose the number at the elbow.
  • Evaluate K-means based on a metric for how well it performs for that later purpose. Choose a K that serve the purpose.
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。