K-Means Clustering

The Ultimate Guide to K-Means Clustering: Definition, Methods and Applications

The ultimate guide to K-means clustering algorithm - definition, concepts, methods, applications, and challenges, along with Python code. Learn Now!

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/#

Properties of Clusters

1.All the data points in a cluster should be similar to each other

If the customers in a particular cluster are not similar to each other, then their requirements might vary, right? If the bank gives them the same offer, they might not like it, and their interest in the bank might reduce. Not ideal.

Having similar data points within the same cluster helps the bank to use targeted marketing. You can think of similar examples from your everyday life and consider how clustering will (or already does) impact the business strategy.

2.The data points from different clusters should be as different as possible.

Which of these cases do you think will give us the better clusters? If you look at case I:

Customers in the red and blue clusters are quite similar to each other. The top four points in the red cluster share similar properties to those of the blue cluster’s top two customers. They have high incomes and high debt values. Here, we have clustered them differently.

Whereas, if you look at case II:

Points in the red cluster completely differ from the customers in the blue cluster. All the customers in the red cluster have high income and high debt, while the customers in the blue cluster have high income and low debt value. Clearly, we have a better clustering of customers in this case.

Hence, data points from different clusters should be as different from each other as possible to have more meaningful clusters. The k-means algorithm uses an iterative approach to find the optimal cluster assignments by minimizing the sum of squared distances between data points and their assigned cluster centroid.

Applications of Clustering in Real-World Scenarios

1.Customer Segmentation

We covered this earlier – one of the most common applications of clustering is customer segmentation. And it isn’t just limited to banking. This strategy is across functions, including telecom, e-commerce, sports, advertising, sales, etc.

2.Document Clustering

This is another common application of clustering. Let’s say you have multiple documents and you need to cluster similar documents together. Clustering helps us group these documents such that similar documents are in the same clusters.

3.Image Segmentation

We can also use clustering to perform image segmentation. Here, we try to club similar pixels in the image together. We can apply clustering to create clusters having similar pixels in the same group.

4.Recommendation Engines

Clustering can also be used in recommendation engines. Let’s say you want to recommend songs to your friends. You can look at the songs liked by that person and then use clustering to find similar songs and finally recommend the most similar songs.

K-Means Clustering?

What Is K-Means Clustering?

K-Means 是一种 基于距离（distance-based） 的 Unsupervised Learning 聚类算法，用来：

把数据分成 K 个簇，使得每个点到自己所属簇中心的距离之和最小

一定要记住这三个关键词：

K（事先指定）

Cluster center = Mean

Minimize distance

K-means clustering is a method for grouping n observations into K clusters. It uses vector quantization and aims to assign each observation to the cluster with the nearest mean or centroid, which serves as a prototype for the cluster. Originally developed for signal processing, K-means clustering is now widely used in machine learning to partition data points into K clusters based on their similarity. The goal is to minimize the sum of squared distances between the data points and their corresponding cluster centroids, resulting in clusters that are internally homogeneous and distinct from each other.

K-Means 在做什么“优化目标”？

Objective Function（目标函数）

K-Means 想最小化的是：

所有数据点 → 各自 cluster center 的距离之和

更正式一点（不用背公式，背意思）：

每个点属于一个 cluster

每个 cluster 有一个中心（mean）

点越靠近自己 cluster 的中心越好

📌 直觉理解：

“每个人都住在离自己社区中心最近的地方”

How to Apply K-Means Clustering Algorithm?

Step 1️⃣：随机初始化 cluster centers

在特征空间中 随机选 K 个点

作为初始 cluster center

📌 注意：

一开始的 center 通常不是最终中心

初始化会影响结果（重要缺点之一）

Step 2️⃣：Assignment（分配样本）

对每一个数据点：

算它到每个 cluster center 的距离 → 分给最近的那个

距离通常是 Euclidean distance

结果：每个点有了 cluster label（临时的）

Step 3️⃣：Update（更新中心）

对每一个 cluster：

重新计算 cluster center = 当前 cluster 内所有点的 mean

这一步就是 K-Means 里 “Means” 的来源

Step 4️⃣：重复 Step 2 & Step 3

不断循环：

重新分配点

重新计算中心

🛑 停止条件（Convergence）

算法停止当：

cluster centers 不再移动

或者移动非常小

或达到最大迭代次数

📌 这时算法 收敛（converge）

Elbow Method

In the context of K-Means clustering, the "elbow" refers to a method used to determine the optimal number of clusters (K) for a given dataset. The idea behind the elbow method is to plot the explained variation as a function of the number of clusters and look for an "elbow point" on the graph. This elbow point represents a point where adding more clusters doesn't significantly improve the model's performance, and it's often considered the optimal number of clusters.

Here's a step-by-step explanation of how the elbow method works:

Cluster the Data: Start by applying the K-Means clustering algorithm to your dataset for a range of K values. You typically try a range of K values from 1 to some maximum value.

Calculate Variance or Distortion: For each value of K, calculate a measure of variation or distortion within the clusters. Common measures include the sum of squared distances between data points and their cluster centers (inertia) or the average within-cluster variance.

Plot the Results: Create a line plot or a scree plot where the x-axis represents the number of clusters (K), and the y-axis represents the corresponding measure of variation or distortion.

Identify the Elbow: Look at the plot, and you'll often observe that the measure of variation decreases as the number of clusters increases. However, at some point, adding more clusters doesn't lead to a significant reduction in the measure. This point where the reduction in variation starts to level off is called the "elbow."

Select the Elbow Point: The K value corresponding to the elbow point is considered the optimal number of clusters. It represents a trade-off between having enough clusters to capture the data's structure and avoiding excessive complexity by having too many clusters.

Pros & Cons

Pros

Very easy to interpret the results and highlighting conclusions in a visual manner.

Very flexible and fast, also scalable for large datasets.

Always yields a result.

Cons

Struggles with a high number of dimensions, need to use PCA or spectral clustering to help fix issue.

Choosing K value manually/based off of your domain knowledge of the problem. Need to use elbow method to assess best K value.

Sensitive to outliers. 对 outliers 敏感

只适合“简单几何结构”

Sensitive to initialization.必须事先指定 K，对初始化敏感

Randomly choose the centroids initially (red and green dots)
Now imagine the points are more widely spread out, we still have the same centroids even though it's not the best option. If the initial centroids are not chosen properly this can cause problems later.

Python

Wholesale customer segmentation

Problem statement

segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc.

Data Preview


#import libraries
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
# reading the data and looking at the first five rows of the data
data=pd.read_csv("Wholesale customers data.csv")
data.head()

Pull out some statistics related to the data


# statistics of the data
data.describe()

Here, we see that there is a lot of variation in the magnitude of the data. Variables like Channel and Region have low magnitude, whereas variables like Fresh, Milk, Grocery, etc., have a higher magnitude.

Since K-Means is a distance-based algorithm, this difference in magnitude can create a problem.

Standardizing the data


# standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# statistics of scaled data
pd.DataFrame(data_scaled).describe()

Create a kmeans function and fit it on the data


# defining the kmeans function with initialization as k-means++
kmeans = KMeans(n_clusters=2, init='k-means++')

# fitting the k means algorithm on scaled data
kmeans.fit(data_scaled)

We have initialized two clusters and pay attention – the initialization is not random here. We have used the k-means++ initialization which generally produces better results as we have discussed in the previous section as well.

Let’s evaluate how well the formed clusters are. To do that, we will calculate the inertia of the clusters:


# inertia on the fitted data
kmeans.inertia_

Output: 2599.38555935614

We got an inertia value of almost 2600. Now, let’s see how we can use the elbow method to determine the optimum number of clusters in Python.

We will first fit multiple k-means models, and in each successive model, we will increase the number of clusters.

Store the inertia value of each model and then plot it to visualize the result


# fitting multiple k-means algorithms and storing the values in an empty list
SSE = []
for cluster in range(1,20):
    kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='k-means++')
    kmeans.fit(data_scaled)
    SSE.append(kmeans.inertia_)

# converting the results into a dataframe and plotting them
frame = pd.DataFrame({'Cluster':range(1,20), 'SSE':SSE})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

Can you tell the optimum cluster value from this plot? Looking at the above elbow curve, we can choose any number of clusters between 5 to 8.

Set the number of clusters as 6 and fit the model


# k means using 5 clusters and k-means++ initialization
kmeans = KMeans(n_jobs = -1, n_clusters = 5, init='k-means++')
kmeans.fit(data_scaled)
pred = kmeans.predict(data_scaled)
#Value count of points in each of the above-formed clusters
frame = pd.DataFrame(data_scaled)
frame['cluster'] = pred
frame['cluster'].value_counts()