Clustering

Unsupervised learing introduction

K-means algorithm

Input:

K (number of clusters)
Training set $\{x^{(1)},x^{(2)},...,x^{(n)}\}\quad x^{(i)}\in R^{n}$

K-means algroithm

Randomly initialize K cluster centroids $\mu_1,\mu_2,...,\mu_K \in R^{n}$

Repeat {
for $i = 1$ to m
$c^{(i)}$ :=index (from 1 to K) of cluster centroid closest to $x^{(i)}$
for $k=1$ to K
$\mu_K$ :=average (mean) of points assigned to cluster $k$
}

K-means for non-separated clusters

Optimization Objective

K-means optimization objective

$c^{(i)}$ = index of cluster (1,2,...,K) to which example $x^{(i)}$ is currently assigned.
$\mu_k$ = cluster centriod $k$ ( $\mu_k\in R^{n}$ )
$\mu_c^{(i)}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Distortion:
$J(c^{(i)},...,c^{(m)},\mu_1,...,\mu_k) = \frac{1}{m}\sum_{i=1}^m||x^{(i)}=\mu_c^{(i)}||^2$

Random Initalization

$K < m$
Randomly pick K trianing examples.
Set $\mu_1,...,\mu_k$ equal to these K examples.

Random initialization many times:

For i=1 to 100 {
Randomly initialize K-means.
Run K-means. Get $c^{(1)},...,c^{(m)},\mu_1,...,\mu_K$
Compute cost function (distortion)
$J(c^{(1)},...,c^{(m)},\mu_1,...,\mu_K)$
}

Pick clustering that gave lowest cost J.

$K=2-10$ , if K is very large, Randomly initilization many times may help less.

Choosing the number of clusters

by hand (see the data set's figure)
Elbow method: compute cost J form 1-..., and draw a figure of it. choose the number at the elbow.
Evaluate K-means based on a metric for how well it performs for that later purpose. Choose a K that serve the purpose.

13. Clustering