KNN

Introduction

K-nearest neighbors (kNN) is a supervised machine learning technique that may be used to handle both classification and regression tasks. I regard KNN as an algorithm that originates from actual life. People tend to be impacted by the people around them.

kNN classifier identifies the class of a data point using the majority voting principle. If k is set to 5, the classes of 5 nearest points are examined. Prediction is done according to the predominant class. Similarly, kNN regression takes the mean value of 5 nearest locations.

The distance between data points is measured. There are various techniques to estimate the distance. Euclidean distance (Minkowski distance with p=2) is one of the most regularly used distance measurements. The graphic below explains how to compute the euclidean distance between two points in a 2-dimensional space. It is determined using the square of the difference between x and y coordinates of the locations.

How KNN works

For an unseen data point, the algorithm calculates the distance between that point and all the observations across all features in the training dataset.

It sorts those distances in ascending order.

It selects K observations with the smallest distances from the above step. These Kobservations are the K-nearest neighbors of that unseen data point.

Note that there should be at least K≥1 observations in the dataset.

It calculates which labels of those neighbors is the most common, and assigns that label to the unseen data point.

Explain KNN to non-technical

1. Look at Data Points:

Instead of neighbors, k-NN looks at data points or examples in a dataset. Each data point represents something, like the characteristics of a house, a person, or a product.

2. Count Their Similarity:

k-NN measures how similar these data points are to each other. It looks at features like size, color, or other attributes to compare them.

3. Make a Prediction:

When you want to know something about a new data point, k-NN finds the k data points in the dataset that are most similar to the new one.

It then looks at what happened with those similar data points to make a prediction about the new data point. For example, if most of the similar houses were sold for a high price, k-NN might guess that the new house will also sell for a high price.

How to find the optimal value of K in KNN?

Finding the optimal value of K in k-Nearest Neighbors (k-NN) involves a process called hyperparameter tuning. The goal is to determine the value of K that results in the best performance for your specific dataset and problem. Here's a systematic approach to find the optimal K:

Divide the Data: Split your dataset into three parts: a training set, a validation set, and a test set. The training set is used to train the model, the validation set is used to tune hyperparameters (including K), and the test set is used for final evaluation.

Choose a Range of K Values: Decide on a range of K values to explore. This range should cover a reasonable span, typically from very small values (e.g., 1 or 2) to larger values (e.g., 20 or 30). The actual range depends on your dataset and problem.

Cross-Validation: Implement a cross-validation loop on the training data. This loop involves the following steps:

For each K value in your chosen range, do the following:

Split the training data into multiple folds (e.g., 5 or 10).
For each fold, train a k-NN model using the current K value and evaluate its performance on the remaining part of the training data (validation set).
Calculate the average performance (e.g., accuracy, F1-score) of the K-NN models across all folds for the current K value.

Repeat this process for each K value in your range.

Select the Best K: Based on the cross-validation results, choose the K value that gives the best average performance on the validation set. This K value is considered the optimal K for your model.

Final Evaluation: Train a k-NN model using the chosen optimal K on the entire training set (including the validation set). Then, evaluate its performance on the test set to get an unbiased estimate of its accuracy or other relevant metrics.

Pros and Cons

Pros

A lazy-learning algorithm, no actual training step, new data is simply tagged to a majority class, based on historical data. Very easy to understand and implement.

Can be used for both classification and regression.

Only one hyper-parameter: k value.

Non-parametric, makes no assumptions about the data/parameters

Parametric: Assume statistical distributions in the data, several conditions need to be met (Ex: Linear Regression)
Non-Parametric: Makes no assumptions about data distribution/conditions for the data to meet.

Cons

Struggles with large high dimensions, the greater the number of dimensions the harder for the algorithm to efficiently calculate distance (Curse of Dimensionality).

Often need to use dimensionality reduction techniques, especially in regression tasks with noisy data

Very sensitive to outliers & noise

Can become very computationally expensive as the dataset grows, need a lot of memory and can become very slow with a large sized dataset

K value selection, often need to estimate ranges or combine with cross-validation techniques to obtain the optimal k value selection

Frequent technique is to plot an elbow graph to find optimal k value

Python Implementation

Python Example 1


#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#getting data
df = pd.read_csv('KNN_Project_Data')

#EDA
# THIS IS GOING TO BE A VERY LARGE PLOT
sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm')

#Standardize Variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
#transform to scaled version
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
#convert df to scaled features
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

#Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
                                                    test_size=0.30)

#Using KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

#Predictions and Evaluations
pred = knn.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

#Choosing K Value
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

#plot elbow
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

#Retrain with new K value
# NOW WITH K=30
knn = KNeighborsClassifier(n_neighbors=30)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=30')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

Python Example 2


#1 IMPORT libraies
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#2 Normalizing & Splitting the Data
# Split the data into features (X) and target (y)
X = df.drop('fraud', axis=1)
y = df['fraud']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


#3 Fitting and Evaluating the Model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.875

Unfortunately, there is no magic way to find the best value for k. We have to loop through many different values, then use our best judgment.

In the below code, we select a range of values for k and create an empty list to store our results. We use cross-validation to find the accuracy scores, which means we don’t need to create a training and test split, but we do need to scale our data. We then loop over the values and add the scores to our list.


#4 Using Cross Validation to Get the Best Value of k
k_values = [i for i in range (1,31)]
scores = []

scaler = StandardScaler()
X = scaler.fit_transform(X)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X, y, cv=5)
    scores.append(np.mean(score))

sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")

#5 More Evaluation Metrics
#We can now train our model using the best k value using the code below.
best_index = np.argmax(scores)
best_k = k_values[best_index]

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.875
Precision: 0.75
Recall: 1.0

We can see from our chart that k = 9, 10, 11, 12, and 13 all have an accuracy score of just under 95%. As these are tied for the best score, it is advisable to use a smaller value for k. This is because when using higher values of k, the model will use more data points that are further away from the original. Another option would be to explore other evaluation metrics