Hierarchical Clustering

Overview

Hierarchical Clustering is a clustering technique used in data analysis and machine learning. It is a type of unsupervised learning method that organizes data points into a hierarchical structure of clusters. The primary goal of hierarchical clustering is to group similar data points together based on their similarities, forming a tree-like structure known as a dendrogram.

Hierarchical clustering is one of the type of clustering. It divides the data points into a hierarchy of clusters. It can be divided into two types- Agglomerative and Divisive clustering.

There are two types of hierarchical clustering. Those types are Agglomerative and Divisive.

The Agglomerative type will make each of the data a cluster. After that, those clusters merge as the hierarchy level goes up. This type is known as the ‘bottom-up’ approach.

The Divisive type will group all of the data as a cluster. After that, the model splits the data as it goes down the hierarchy. This type is known as the ‘top-down’ approach.

Linkage

In hierarchical clustering, we do not only measure the distance between the data.

Instead, we need to measure the distance between two clusters. This measurement is known as linkage.

There are several linkage methods exist, such as complete linkage, average linkage, and ward linkage.

The single linkage defines the distance with the minimum distance of two points in the two clusters.

The complete linkage defines the distance as the maximum distance between two points in the two clusters.

The average linkage calculates the average distance between all pairs of data in the two clusters

Ward linkage will minimize the variance of the distances between all possible pairs of those clusters

Dendrogram

To visualize relationships between clusters, we can use a diagram called a dendrogram. What is a dendrogram?

The dendrogram is a tree-like chart to represent the hierarchical structure of data. It consists of leaves and branches.

In hierarchical clustering, leaves are the data points, and branches represent the clusters.

From the branches, we can see the relationship between data points and how similar each of them is based on their features.

Pros and Cons

Pros

There is no need to pre-specify the number of clusters. Instead, the dendrogram can be cut at the appropriate level to obtain the desired number of clusters.

Data is easily summarized/organized into a hierarchy using dendrograms. Dendrograms make it easy to examine and interpret clusters.

Cons

Hierarchical Clustering does not work well on vast amounts of data.

· Does not work very well with missing data

·Algorithm can never undo what was done previously.

· Time complexity of at least O(n2 log n) is required, where ’n’ is the number of data points.

·Based on the type of distance matrix chosen for merging different algorithms can suffer with one or more of the following:

i) Sensitivity to noise and outliers

ii) Breaking large clusters

iii) Difficulty handling different sized clusters and convex shapes

Python


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values

#Using the dendrogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

#Training the Hierarchical Clustering model on the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

#Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()