2024-03-27 k-means聚类

简介

聚类算法是一种无监督机器学习模型,它直接从数据的内在性质中内在性质中学习最优的划分结果或者确定离散标签类型。
最简单的k-means聚类算法:

  • cluster center,该簇所有数据点的算术平均值
  • 每个点到自己cluster center的距离比到其他cluster centers近
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
plt.scatter(X[:,0],X[:,1],s=50);

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure()
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
a25a1053989d4880a87a10f344703761.png

k-means可以自动完成4个簇的识别。它使用了期望最大化算法:

  1. 猜测一些簇中心
  2. 重复直到收敛:
    • E-step 期望步骤: 分配点到最近的簇中心
    • M-step 最大化步骤: 更新簇中心为所有点平均值
from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, n_clusters, rseed=2):
    # 1. Randomly choose clusters
    rng = np.random.RandomState(rseed)
    i = rng.permutation(X.shape[0])[:n_clusters]
    centers = X[i]
    
    while True:
        # 2a. Assign labels based on closest center
        labels = pairwise_distances_argmin(X, centers)
        
        # 2b. Find new centers from means of points
        new_centers = np.array([X[labels == i].mean(0)
                                for i in range(n_clusters)])
        
        # 2c. Check for convergence
        if np.all(centers == new_centers):
            break
        centers = new_centers
    
    return centers, labels

centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

k-means的缺点:

  • 不一定是全局最优
  • 需要事先指定簇数量
  • 只能确定线性边界
  • 数据量大时速度慢
    非线性边界可以使用核变换投影到高维空间,使用最近邻图来计算数据的高维表示:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2,affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis');
d9b4847ec2514ae59e7747d1bde9146d.png

案例:手写数字

将1767个64维数据,分为10个类。显示簇中心、准确率、混淆矩阵。

from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

from sklearn.metrics import accuracy_score
print(accuracy_score(digits.target, labels))
plt.figure()
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
1026bc2780e74abb99b9d12e2914d33c.png

使用t-分布邻域嵌入算法进行预处理(64维降到2维),提高准确率

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]
print(accuracy_score(digits.target, labels))

案例: 图像色彩压缩

该图像存储在一个(height,width,RGB)的三维数组中,每个元素以0~255的整数表示红绿蓝信息。具体维度(427,640,3)
对像素空间(特征矩阵)使用k-means聚类,将255^3\simeq1600万种颜色缩减到16种。使用了MiniBatchKmeans算法对数据集的子集进行计算,速度更快。

from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
print(china.shape)

data = china /255
data=data.reshape(427*640,3)
print(data.shape)

from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6),
                       subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);

参考:
[1]美 万托布拉斯 (VanderPlas, Jake).Python数据科学手册[M].人民邮电出版社,2018.

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容